I spoke at devs love bacon back in April on Everything You Need to know about Machine Learning in 30 Minutes or Less. The talk is geared toward engineers with no prior knowledge of machine learning, and it’s designed to lay out the basic vocabulary and way that we think about the world to provide an amusing foundation so that attendees will have a head start in investigating which techniques they might want to learn more about or implement. This talk is not an in-depth tutorial.
I just got home from the Web 2.0 Summit, a three-day conference that was packed with announcements, interesting ideas, and good conversations.
My short talk, The Secrets of our Data Subconscious, touches on how the data we generate online interactions with the physical world spatially and through time, and on the relationships between the things we consume (in private) and the things we broadcast (in public).
The first Strata Conference in New York just wound up. It was a five day expo of business, data, and tech, and brought a ton of great people in the data community to New York.
Thanks so much to Edd and Alistair and everyone whose hard work made this possible!
My talk, Short URLs, Big Data: Learning in Realtime is already online:
And the slides are up on Slideshare:
I’m really excited that An Introduction to Machine Learning with Web Data is now available for purchase!
This is a 2 hour and 43 minute instructional video that walks you through basic machine learning algorithms, first theoretically and mathematically, and then with Python example code (which is available here).
This video is an instructional take and builds on the material I covered in my Strange Loop 2010 keynote Machine Learning: A Love Story and the Data Bootcamp I did with Joe Adler, Drew Conway, and Jake Hofman at the Strata Conference in February.
I’d also like to acknowledge the many collaborators, colleagues, and friends who have made definite contributions to my thinking about this material and how best to present it, particularly Chris Wiggins who co-authored A Taxonomy of Data Science and Andrew, Dennis, Jan, Jesse, and Julie, the members of the studio audience for the class (who were amazing).
If you like it, please leave it a good review! As always, questions and comments are welcome here or by e-mail.
I gave the opening keynote this morning at PyCon.
The one thing that everyone in the room at PyCon has in common is that we all love to code. I used that as the central theme of the talk, spoke about the constructs that give us joy, the history of some of our favorite patterns (they date as far back as the 60s!) and proposed that we think about the way we’ll compute fifty years into the future. There’s also a bit of fun data hacking, of course.
Enjoy the slides. The video is up!
Please let me know here or on Twitter if you have any questions or comments.
The video from my keynote at Strange Loop 2010 is up!
You can watch the video here: Machine Learning: A Love Story
The original abstract:
Machine learning has come a long way in recent years — from a long-marginalized field so old it still has the word “machine” in the name, to the last, best hope for making sense of our massive flows of data.
The art of ‘data science’ is asking the right questions; the answers are generally trivial or impossible. This talk will focus more on questions than on answers. I’ll give a brief history of the field with a focus on the fundamental math and algorithmic tools that we use to address these kinds of problems, then walk through several descriptive and predictive scenarios.
Finally, I’ll show one example system using bit.ly data in-depth, from the backend infrastructure through the algorithms and data processing layer to show a functioning product.
Attendees should expect to hear some good stories of data gone right and data gone awry, and walk away with a few new clever tricks.
The presentation was calibrated for the audience in the room, but I’ll be happy to answer any questions in the comments below!
I gave a talk called A Data-driven Look at the Realtime Web Ecosystem at the Web2Expo SF conference in May in San Francisco. I attempted to highlight some of the interesting facets of the bit.ly data set, and it appeared to be well-received (showing up on TechCrunch, ZDNet, and a few other places).
I attended the full conference, and it was great. The attendees were extremely international and I met a ton of fascinating people.
I’m still getting a couple of e-mail requests per week for my slides and materials, so they’re posted below for posterity.
And the video:
As always, I welcome your questions or comments.
I’m honored and excited to be participating in Rhizome’s new conference Seven on Seven, where technologists and artists are paired up to create a completely new project in 24-hours.
The formal description:
Seven on Seven will pair seven leading artists with seven game-changing technologists in teams of two, and challenge them to develop something new –be it an application, social media, artwork, product, or whatever they imagine– over the course of a single day. The seven teams will unveil their ideas at a one-day event at the New Museum on April 17th.
I really love this idea because the time constraints and the inherent discomfort of the situation (working in an unfamiliar space with an unfamiliar person) makes it likely that we’ll be able to accomplish something creative and unexpected. Or else it will go completely awry, which will still be amusing for the audience.
I’ve had a lot of fun and been able to work on some interesting projects at hackathons in the past, and I hope this one will be even better.
I recently attended the Third Annual Workshop on Search and Social Media, an academic workshop with very strong industry participation. The workshop was packed, and had some of the most informative and interesting panel discussions I’ve seen (not counting the one I spoke on!).
Daniel Tunkelang did a great job of writing up the specific presentations on his site and on the ACM blog, so I won’t attempt to re-create the presentations line by line at this late date. Rather, I’d like to highlight a few open problems and research questions that came out of the discussions that I hope to see developed in the next year.
Social search consists of a set of problems including (but hardly limited to) search of social content like status updates, real-time search, generating, labeling, and finding user-generated content, ‘long-tail’ events and interests, finding vs re-finding, and trend identification.
What data is available to social search? There are many kinds of social data, from e-mail (private) to blogs (public) and tweets (mostly public) — what is and should be searchable? How do we handle issues of privacy and identity management?
How do we compute relevance, taking into account freshness, accuracy, and degrees of social separation?
Will the architecture of these search engines look like the search engines we’re currently familiar with?
How do we evaluate accuracy and truthiness of social data?
How do we characterize social connections, around concepts like strong vs weak ties, and friend-of-a-friend vs friend-of-a-friend’s-friend? Can we converge on a single social graph representation?
How do we best filter social data to lead to accurate recommendations for content discovery? How do we accommodate the fact that as we move beyond static factual data, two people using the same query may be looking for very different results?
Finally, how do we deal with the chasm between the industry participants (who have LOTS of data) and the academic participants, who suffer from a lack of public (and publishable) data?