I just got home from the Web 2.0 Summit, a three-day conference that was packed with announcements, interesting ideas, and good conversations.
My short talk, The Secrets of our Data Subconscious, touches on how the data we generate online interactions with the physical world spatially and through time, and on the relationships between the things we consume (in private) and the things we broadcast (in public).
The first Strata Conference in New York just wound up. It was a five day expo of business, data, and tech, and brought a ton of great people in the data community to New York.
Thanks so much to Edd and Alistair and everyone whose hard work made this possible!
My talk, Short URLs, Big Data: Learning in Realtime is already online:
And the slides are up on Slideshare:
I gave the opening keynote this morning at PyCon.
The one thing that everyone in the room at PyCon has in common is that we all love to code. I used that as the central theme of the talk, spoke about the constructs that give us joy, the history of some of our favorite patterns (they date as far back as the 60s!) and proposed that we think about the way we’ll compute fifty years into the future. There’s also a bit of fun data hacking, of course.
Enjoy the slides. The video is up!
Please let me know here or on Twitter if you have any questions or comments.
You can catch an interview (or see a writeup) that I did live from the Strata Conference on Silicon Angle TV! We talk about bit.ly data, politics, and touch briefly on some of the interesting problems that we’re working on.
[I removed the video embed because it was annoyingly auto-playing in some browsers. You can still see the video here.]
I gave a talk called A Data-driven Look at the Realtime Web Ecosystem at the Web2Expo SF conference in May in San Francisco. I attempted to highlight some of the interesting facets of the bit.ly data set, and it appeared to be well-received (showing up on TechCrunch, ZDNet, and a few other places).
I attended the full conference, and it was great. The attendees were extremely international and I met a ton of fascinating people.
I’m still getting a couple of e-mail requests per week for my slides and materials, so they’re posted below for posterity.
And the video:
As always, I welcome your questions or comments.
I recently attended the Third Annual Workshop on Search and Social Media, an academic workshop with very strong industry participation. The workshop was packed, and had some of the most informative and interesting panel discussions I’ve seen (not counting the one I spoke on!).
Daniel Tunkelang did a great job of writing up the specific presentations on his site and on the ACM blog, so I won’t attempt to re-create the presentations line by line at this late date. Rather, I’d like to highlight a few open problems and research questions that came out of the discussions that I hope to see developed in the next year.
Social search consists of a set of problems including (but hardly limited to) search of social content like status updates, real-time search, generating, labeling, and finding user-generated content, ‘long-tail’ events and interests, finding vs re-finding, and trend identification.
What data is available to social search? There are many kinds of social data, from e-mail (private) to blogs (public) and tweets (mostly public) — what is and should be searchable? How do we handle issues of privacy and identity management?
How do we compute relevance, taking into account freshness, accuracy, and degrees of social separation?
Will the architecture of these search engines look like the search engines we’re currently familiar with?
How do we evaluate accuracy and truthiness of social data?
How do we characterize social connections, around concepts like strong vs weak ties, and friend-of-a-friend vs friend-of-a-friend’s-friend? Can we converge on a single social graph representation?
How do we best filter social data to lead to accurate recommendations for content discovery? How do we accommodate the fact that as we move beyond static factual data, two people using the same query may be looking for very different results?
Finally, how do we deal with the chasm between the industry participants (who have LOTS of data) and the academic participants, who suffer from a lack of public (and publishable) data?
I was invited to speak on a panel on semantic metadata, moderated by Paul Ford (harpers.org) along with Marco Neumann (KONA) and Paul Tarjan (Yahoo/Search Monkey). The panel was a lively discussion, and we got some great questions from the audience.
After the panel, I stayed around to participate in the hack competition. Yahoo! provided a fantastic space, with free-flowing coffee, snacks, comfy chairs and plenty of Yahoo folks and other hackers around to give advice and play foosball with. I teamed up with Diana Eng, Alicia Gibb, and Bill Ward to create the Del.icio.us Cake!
The cake is attached to a laptop via USB. A program running on the laptop accepts a delicious tag and retrieves a list of recent popular sites for that tag from the delicious API. Finally, it iterates through each URL, downloads the page, and computes the sentiment of that page relative to the tag — basically, is the content of the page positive, neutral or negative?
The signal is output to an ardiuno (hidden in the middle of the cake) which turns on the appropriate set of LEDs. There are four sets of LEDs on the cake, one in each quadrant of the delicious logo, one each for positive sentiment, neutral or inconclusive sentiment, and negative sentiment, and, of course, one to let us know that the cake is turned on.
I wrote the sentiment classifiers between around 3am and 6am Saturday morning, so they really were a hack! I trained them on movie reviews data, working with the assumption that 5-star reviews contain positive terms and 1-star reviews contain negative terms. I wouldn’t recommend this approach for a serious attempt at sentiment analysis, but it worked well enough.
We won the food/hardware hack prize, shared with the awesome MakerBot team!
We had a great time creating and presenting the hack. Thanks, Yahoo, and most of all, thanks to Alicia, Bill, and Diana for a really fantastic, silly weekend.
- Yahoo’s summary of the Open Hack NYC event
- Diana’s writeup for Eyebeam
- CNN.com: Hackers Take Over Times Square
Yesterday, I attended the first Hadoop World NYC conference. Hadoop is a platform for scalable distributed computing. In essence, it makes analyzing large quantities of data much faster, and analyzing very large quantities of data possible.
Cloudera did a great job organizing the conference, and managed to assemble a diverse set of speakers. The sessions covered everything from academic research to fraud detection to bioinformatics and even helping people fall in love (eHarmony uses Hadoop)!
I’m not going to review every session, but I saw several themes emerging from the content and conversations.
Hadoop is Getting Easier
New integrated UIs like Cloudera Desktop and Karmasphere mean that developers will no longer be required to use a command-line interface to configure and execute Hadoop jobs. IBM’s M2 project hides Hadoop behind a spreadsheet metaphor, making the collection, analysis and visualization of data as easy as using Excel.
This doesn’t just speed up development time, it puts the tools for manipulating the data directly in the hands of the people who need the results, without requiring them to talk to a database programmer.
Hadoop is a Utility
The only organizations that talked about building their own Hadoop clusters are those who deal with very sensitive data (VISA) and those who deal with very very large quantities of data (Yahoo, Facebook, eBay). Organizations with more manageable data sets, such as eHarmony and the New York Times, use EC2 and Amazon’s Elastic Map-Reduce. Amazon, Rackspace, and Softlayer have offerings in this area and were all event sponsors.
Yes, you can turn on a cluster of nodes from your living room in your PJs!
Hadoop Can Talk to Your Existing Systems
Hadoop has an ecosystem of supporting products that allow organizations to adapt their existing infrastructure. Cloudera’s Sqoop (which is just fun to say out loud) is a tool for importing data from SQL databases, HBase is a Hadoop database, and Pig lets you talk to the system in a SQL-like language.
I expect we’ll see more information available in the near future to clarify which systems are more appropriate for which kinds of users (an ecosystem decision tree?).
Hadoop is Changing Things
I heard the phrase “an order of magnitude improvement in speed” so many times that I lost count. Speaking from personal experience, the difference you see in productivity between waiting minutes and hours for results and waiting days is immense. When you can see the answer to a question shortly after you ask it you can preserve the context you need to act on that answer immediately without having to spend the time to figure out why you were asking that question in the first place.
Most of the projects were doing fairly simple analysis over data like web user sessions or transactions. I was intrigued by Deepak Singh’s talk on bioinformatics and genome sequencing (slides) and Jake Hofman‘s talk on social network analysis (slides). More and more massive datasets are becoming available and will drive techniques for new analysis. I do wish there had been a talk about Mahout, which is a very promising approach to developing machine learning algorithms on the Hadoop platform.
I left the event more excited about the technology and very enthusiastic about the community. Thanks for a great day!
Update: A few other people have written up their notes and impressions from the event:
- Stephen O’grady posted The View from HadoopWorld
- Deepak Singh’s Post-HadoopWorld Thoughts
- HubSpot Dev Blog has two write-ups, by Dan and Steve
- Atbrox has notes from the morning session and the application session
- Alexander Sicular’s Are You New to Hadoop? Settle in…
- Pete Skomoroch posted his slides and thoughts
I gave a talk at BarCampNYC4 on Saturday on common data problems and a very light overview of algorithms that address them.
I delivered the majority of the content verbally, by talking through examples of problems and how to solve them, so there’s no guarantee that these slides will make sense, but they might be funny!
Sanford took some excellent notes during the presentation.
The discussion was so lively and engaging that I’m planning to expand on this content — I really welcome your suggestions and comments!