Hadoop World NYC

Yesterday, I attended the first Hadoop World NYC conference. Hadoop is a platform for scalable distributed computing. In essence, it makes analyzing large quantities of data much faster, and analyzing very large quantities of data possible.

Cloudera did a great job organizing the conference, and managed to assemble a diverse set of speakers. The sessions covered everything from academic research to fraud detection to bioinformatics and even helping people fall in love (eHarmony uses Hadoop)!

I’m not going to review every session, but I saw several themes emerging from the content and conversations.

Hadoop is Getting Easier

New integrated UIs like Cloudera Desktop and Karmasphere mean that developers will no longer be required to use a command-line interface to configure and execute Hadoop jobs. IBM’s M2 project hides Hadoop behind a spreadsheet metaphor, making the collection, analysis and visualization of data as easy as using Excel.

This doesn’t just speed up development time, it puts the tools for manipulating the data directly in the hands of the people who need the results, without requiring them to talk to a database programmer.

Hadoop is a Utility

The only organizations that talked about building their own Hadoop clusters are those who deal with very sensitive data (VISA) and those who deal with very very large quantities of data (Yahoo, Facebook, eBay). Organizations with more manageable data sets, such as eHarmony and the New York Times, use EC2 and Amazon’s Elastic Map-Reduce. Amazon, Rackspace, and Softlayer have offerings in this area and were all event sponsors.

Yes, you can turn on a cluster of nodes from your living room in your PJs!

Hadoop Can Talk to Your Existing Systems

Hadoop has an ecosystem of supporting products that allow organizations to adapt their existing infrastructure. Cloudera’s Sqoop (which is just fun to say out loud) is a tool for importing data from SQL databases, HBase is a Hadoop database, and Pig lets you talk to the system in a SQL-like language.

I expect we’ll see more information available in the near future to clarify which systems are more appropriate for which kinds of users (an ecosystem decision tree?).

Hadoop is Changing Things

I heard the phrase “an order of magnitude improvement in speed” so many times that I lost count. Speaking from personal experience, the difference you see in productivity between waiting minutes and hours for results and waiting days is immense. When you can see the answer to a question shortly after you ask it you can preserve the context you need to act on that answer immediately without having to spend the time to figure out why you were asking that question in the first place.

Most of the projects were doing fairly simple analysis over data like web user sessions or transactions. I was intrigued by Deepak Singh’s talk on bioinformatics and genome sequencing (slides) and Jake Hofman‘s talk on social network analysis (slides). More and more massive datasets are becoming available and will drive techniques for new analysis. I do wish there had been a talk about Mahout, which is a very promising approach to developing machine learning algorithms on the Hadoop platform.

I left the event more excited about the technology and very enthusiastic about the community. Thanks for a great day!

Update: A few other people have written up their notes and impressions from the event:

http://jakehofman.com/talks/hadoopworld_20091002.pdf

http://www.slideshare.net/mndoci/hadoop-for-bioinformatics


9 Comments on “Hadoop World NYC”

  1. […] This post was mentioned on Twitter by jake hofman. jake hofman said: RT @hmason: My observations of trends at Hadoop World NYC: http://bit.ly/vwDC8 #hadoopworld […]

  2. […] A couple of people I got to meet all too briefly or not at all were the aforementioned Jake Hofman and Hilary Mason (many Friendfeeders will really appreciate her blog). Hilary also blogged about her post-Hadoop World thoughts. […]

  3. Richard Zak says:

    Might you have notes from the “Hadoop for Bio-Informatics” presentation by Deepak Singh?

  4. Hilary Mason says:

    Richard,

    I did take notes and I’m happy to share (please just send an e-mail), but I think most of the content was summarized in his slides:

    http://bit.ly/3gXMOn

    Unfortunately, nothing can capture the dynamic of the presentation. It was one of the best of the day.

  5. […] conference in general, there is some good commentary out there, from Dan Milstein, Steve Laniel, Hilary Mason, and no doubt […]

  6. mohd.nishat akhtar says:

    I have already installed hadoop on 8 nodes…Please can u give me any idea how to run BLAST programs on 8 node cluster…

  7. […] Hadoop World NYC (hilarymason.com) […]

  8. […] Hadoop World NYC – May 24th %(postalicious-tags)( tags: hadoop mapreduce conference hadoopworld aws distributed ec2 data )% […]

  9. Ricardo says:

    Hilary: what was your impression with Pig? Have you tried it? I think it has a lot of potential but unfortunately the documentation is very poor and appears to me that still has many limitations. If Yahoo is using it in production now, I imagine they use it for simple tasks (aggregation, groupings, etc).