Hadoop World NYC

Yesterday, I attended the first Hadoop World NYC conference. Hadoop is a platform for scalable distributed computing. In essence, it makes analyzing large quantities of data much faster, and analyzing very large quantities of data possible.

Cloudera did a great job organizing the conference, and managed to assemble a diverse set of speakers. The sessions covered everything from academic research to fraud detection to bioinformatics and even helping people fall in love (eHarmony uses Hadoop)!

I’m not going to review every session, but I saw several themes emerging from the content and conversations.

Hadoop is Getting Easier

New integrated UIs like Cloudera Desktop and Karmasphere mean that developers will no longer be required to use a command-line interface to configure and execute Hadoop jobs. IBM’s M2 project hides Hadoop behind a spreadsheet metaphor, making the collection, analysis and visualization of data as easy as using Excel.

This doesn’t just speed up development time, it puts the tools for manipulating the data directly in the hands of the people who need the results, without requiring them to talk to a database programmer.

Hadoop is a Utility

The only organizations that talked about building their own Hadoop clusters are those who deal with very sensitive data (VISA) and those who deal with very very large quantities of data (Yahoo, Facebook, eBay). Organizations with more manageable data sets, such as eHarmony and the New York Times, use EC2 and Amazon’s Elastic Map-Reduce. Amazon, Rackspace, and Softlayer have offerings in this area and were all event sponsors.

Yes, you can turn on a cluster of nodes from your living room in your PJs!

Hadoop Can Talk to Your Existing Systems

Hadoop has an ecosystem of supporting products that allow organizations to adapt their existing infrastructure. Cloudera’s Sqoop (which is just fun to say out loud) is a tool for importing data from SQL databases, HBase is a Hadoop database, and Pig lets you talk to the system in a SQL-like language.

I expect we’ll see more information available in the near future to clarify which systems are more appropriate for which kinds of users (an ecosystem decision tree?).

Hadoop is Changing Things

I heard the phrase “an order of magnitude improvement in speed” so many times that I lost count. Speaking from personal experience, the difference you see in productivity between waiting minutes and hours for results and waiting days is immense. When you can see the answer to a question shortly after you ask it you can preserve the context you need to act on that answer immediately without having to spend the time to figure out why you were asking that question in the first place.

Most of the projects were doing fairly simple analysis over data like web user sessions or transactions. I was intrigued by Deepak Singh’s talk on bioinformatics and genome sequencing (slides) and Jake Hofman‘s talk on social network analysis (slides). More and more massive datasets are becoming available and will drive techniques for new analysis. I do wish there had been a talk about Mahout, which is a very promising approach to developing machine learning algorithms on the Hadoop platform.

I left the event more excited about the technology and very enthusiastic about the community. Thanks for a great day!

Update: A few other people have written up their notes and impressions from the event:



My NYC Python Meetup Presentation: Practical Data Analysis in Python

I gave a talk at the NYC Python Meetup on July 29 on Practical Data Analysis in Python.

I tend to use my slides for visual representations of the concepts I’m discussing, so there’s a lot of content that was in the presentation that you unfortunately won’t see here.

The talk starts with the immense opportunities for knowledge derived from data. I spent some time showing data systems ‘in the wild’ along with the appropriate algorithmic vocabulary (for example, amazon.com‘s ‘books you might like’ feature is a recommender system).

Once we can describe the problems properly, we can look for tools, and Python has many! Finally, in the fun part of the presentation, I demoed working code that uses NLTK to build a Twitter spam filter with 90% accuracy*.

Please let me know if you have questions or comments.

* I’ll post the code and training data shortly