Should you attend Hadoop World? Yes.

I received this e-mail via my contact form:

I just discovered you via a Google search because I’m highly considering attending this year’s upcoming Hadoop World in NYC. I appreciate your page that you wrote up after attending last year’s event. I’m wondering if you feel that Hadoop has enough momentum and support to be a “here to stay” technology worth investing one’s time and education into, or is it possible it might fade and be deprecated by something else as the need for big data analysis continues to grow? …

I’ve had a few similar conversation with people lately, and I thought posting my response might help others making similar decisions. The e-mail is referencing my post from last year’s hadoop world NYC.

Thanks for reaching out. There are several questions in your message
and I hope that this will address them all.

This IS an extremely exciting time to be alive and working with data.
We now have the capacity to learn thing about our systems, people in
general, and the world that we simply couldn’t know before — the
field is only going to keep growing.

Hadoop is currently the primary tool framework for this kind of data
analysis. It’s certainly worth learning now, especially since Amazon’s
elastic mapreduce makes it very easy for individuals and small team to
get started without a large investment.

I’m also not a huge fan of Java and I wish more resources were going
into non-Java alternatives. Fortunately, you can use hadoop via the
streaming API in most any language you choose.

I do think it’s important to separate the discussion of tools from the
larger philosophical discussion of open problems, algorithms, and
techniques. You can learn the tools from books and blogs. The real
reason to go to a conference like Hadoop World is to meet the people
who go to conferences like Hadoop World and get into those deeper
conversations.

I do hope this year’s conference will highlight the difference between tools and methods, and will also provide plenty of space for those casual hallways conversations.


Hadoop World NYC

Yesterday, I attended the first Hadoop World NYC conference. Hadoop is a platform for scalable distributed computing. In essence, it makes analyzing large quantities of data much faster, and analyzing very large quantities of data possible.

Cloudera did a great job organizing the conference, and managed to assemble a diverse set of speakers. The sessions covered everything from academic research to fraud detection to bioinformatics and even helping people fall in love (eHarmony uses Hadoop)!

I’m not going to review every session, but I saw several themes emerging from the content and conversations.

Hadoop is Getting Easier

New integrated UIs like Cloudera Desktop and Karmasphere mean that developers will no longer be required to use a command-line interface to configure and execute Hadoop jobs. IBM’s M2 project hides Hadoop behind a spreadsheet metaphor, making the collection, analysis and visualization of data as easy as using Excel.

This doesn’t just speed up development time, it puts the tools for manipulating the data directly in the hands of the people who need the results, without requiring them to talk to a database programmer.

Hadoop is a Utility

The only organizations that talked about building their own Hadoop clusters are those who deal with very sensitive data (VISA) and those who deal with very very large quantities of data (Yahoo, Facebook, eBay). Organizations with more manageable data sets, such as eHarmony and the New York Times, use EC2 and Amazon’s Elastic Map-Reduce. Amazon, Rackspace, and Softlayer have offerings in this area and were all event sponsors.

Yes, you can turn on a cluster of nodes from your living room in your PJs!

Hadoop Can Talk to Your Existing Systems

Hadoop has an ecosystem of supporting products that allow organizations to adapt their existing infrastructure. Cloudera’s Sqoop (which is just fun to say out loud) is a tool for importing data from SQL databases, HBase is a Hadoop database, and Pig lets you talk to the system in a SQL-like language.

I expect we’ll see more information available in the near future to clarify which systems are more appropriate for which kinds of users (an ecosystem decision tree?).

Hadoop is Changing Things

I heard the phrase “an order of magnitude improvement in speed” so many times that I lost count. Speaking from personal experience, the difference you see in productivity between waiting minutes and hours for results and waiting days is immense. When you can see the answer to a question shortly after you ask it you can preserve the context you need to act on that answer immediately without having to spend the time to figure out why you were asking that question in the first place.

Most of the projects were doing fairly simple analysis over data like web user sessions or transactions. I was intrigued by Deepak Singh’s talk on bioinformatics and genome sequencing (slides) and Jake Hofman‘s talk on social network analysis (slides). More and more massive datasets are becoming available and will drive techniques for new analysis. I do wish there had been a talk about Mahout, which is a very promising approach to developing machine learning algorithms on the Hadoop platform.

I left the event more excited about the technology and very enthusiastic about the community. Thanks for a great day!

Update: A few other people have written up their notes and impressions from the event:

http://jakehofman.com/talks/hadoopworld_20091002.pdf

http://www.slideshare.net/mndoci/hadoop-for-bioinformatics