Conference: PyCon 2011 Keynote!

I gave the opening keynote this morning at PyCon.

The one thing that everyone in the room at PyCon has in common is that we all love to code. I used that as the central theme of the talk, spoke about the constructs that give us joy, the history of some of our favorite patterns (they date as far back as the 60s!) and proposed that we think about the way we’ll compute fifty years into the future. There’s also a bit of fun data hacking, of course.

Enjoy the slides. The video is up!

Please let me know here or on Twitter if you have any questions or comments.

Betaworks Builds a Makerbot

A few weeks ago, a bunch of us spent two long evenings in the office assembling a MakerBot. Hudson Lines made an awesome timelapse video of it.

Betaworks builds a MakerBot from hudson on Vimeo.

Special thanks to the always awesome Adam (from MakerBot) for helping us breeze through the final configuration and calibration steps.

NPR: Interview on Science Friday

On Friday, January 28th I hopped in a cab and went up to NPR’s Bryant Park recording studio for a fifteen minute chat with Ira Flatow, host of Science Friday. I’ve been a big fan of Ira and Science Friday since I discovered the show years ago, and it was a very exciting honor to be a guest.

The image at the right is a snapshot I took with my phone while nervously waiting outside the studio.

The title of the segment is the rather dramatic Privacy At Stake As Sites Track Online Preferences. Our conversation wound around the issues of tracking user data online, and the potential opportunities and dangers that all users of online services face.

NPR has the full broadcast and transcript online.

By far the most fun and unexpected aspect of this was the number of people who wrote to me to ask questions or say that they appreciated my perspective. Many of them don’t typically follow technology news or startups, and it’s exciting to hear from people who heard the interview and were intrigued.

Machine Learning: A Love Story

The video from my keynote at Strange Loop 2010 is up!

You can watch the video here: Machine Learning: A Love Story

The original abstract:

Machine learning has come a long way in recent years — from a long-marginalized field so old it still has the word “machine” in the name, to the last, best hope for making sense of our massive flows of data.

The art of ‘data science’ is asking the right questions; the answers are generally trivial or impossible. This talk will focus more on questions than on answers. I’ll give a brief history of the field with a focus on the fundamental math and algorithmic tools that we use to address these kinds of problems, then walk through several descriptive and predictive scenarios.

Finally, I’ll show one example system using data in-depth, from the backend infrastructure through the algorithms and data processing layer to show a functioning product.

Attendees should expect to hear some good stories of data gone right and data gone awry, and walk away with a few new clever tricks.

The presentation was calibrated for the audience in the room, but I’ll be happy to answer any questions in the comments below!

Twitter Succeeds Because it Fails

How can twitter be so popular and successful if it’s down all the time?

We base statements like this on the assumption that quality of a web application maps linearly to the application’s stability. This is obviously true for most sites most of the time, but things get interesting at the edge where rare, unpredictable failure actually enables more complex human interactions around the service.

Unlike e-mail, twitter etiquette doesn’t demand that you read or reply to every message from every person you follow (or who follows you). Combine that lightweight social touch with occasional technical issues and human communication patterns, and we start to see some interesting behavior.

Twitter’s lack of reliability as a platform allows us to use the technical failings to mask our own social imperfections. How often have you heard or said something like “I was sure I was following you” or “I must not have gotten that DM” or even “I think I tweeted that…”? Even just a small percent of users behaving this way changes the social expectations.

I’d love to construct an experiment to figure out whether this idea has merit, and if so, what the optimal amount of unavailable operations for social deniability is. Should 1 in 100 actions fail? 1 in 10,000? 1 in 1,000,000? Does it matter if any fail, as long as we believe that every so often failure occurs? (How often do things really get lost in the mail, anyway?)

It’s amusing to conceive of a system that succeeds socially because it often fails technically.

Should you attend Hadoop World? Yes.

I received this e-mail via my contact form:

I just discovered you via a Google search because I’m highly considering attending this year’s upcoming Hadoop World in NYC. I appreciate your page that you wrote up after attending last year’s event. I’m wondering if you feel that Hadoop has enough momentum and support to be a “here to stay” technology worth investing one’s time and education into, or is it possible it might fade and be deprecated by something else as the need for big data analysis continues to grow? …

I’ve had a few similar conversation with people lately, and I thought posting my response might help others making similar decisions. The e-mail is referencing my post from last year’s hadoop world NYC.

Thanks for reaching out. There are several questions in your message
and I hope that this will address them all.

This IS an extremely exciting time to be alive and working with data.
We now have the capacity to learn thing about our systems, people in
general, and the world that we simply couldn’t know before — the
field is only going to keep growing.

Hadoop is currently the primary tool framework for this kind of data
analysis. It’s certainly worth learning now, especially since Amazon’s
elastic mapreduce makes it very easy for individuals and small team to
get started without a large investment.

I’m also not a huge fan of Java and I wish more resources were going
into non-Java alternatives. Fortunately, you can use hadoop via the
streaming API in most any language you choose.

I do think it’s important to separate the discussion of tools from the
larger philosophical discussion of open problems, algorithms, and
techniques. You can learn the tools from books and blogs. The real
reason to go to a conference like Hadoop World is to meet the people
who go to conferences like Hadoop World and get into those deeper

I do hope this year’s conference will highlight the difference between tools and methods, and will also provide plenty of space for those casual hallways conversations.

A quick twitter bot, @bc_l

Several months ago, on a whim inspired by an off-hand comment from Chris, I created a bot to bring the wonders of the Unix bc language to twitter.

bc is a command-line calculator that’s fast and has the capacity to do some fairly complex math.

Try it out on the command line:

echo '100 / 10' | bc -l

…Or by sending a direct message to bc_l (if you follow bc_l it will follow you back within a few hours).

I released the code under GPL, and it’s available on github:

John Cook mentions the bot and makes some great observations in his post three surprises with bc.

Conference: Web2 Expo SF

I gave a talk called A Data-driven Look at the Realtime Web Ecosystem at the Web2Expo SF conference in May in San Francisco. I attempted to highlight some of the interesting facets of the data set, and it appeared to be well-received (showing up on TechCrunch, ZDNet, and a few other places).

I attended the full conference, and it was great. The attendees were extremely international and I met a ton of fascinating people.

I’m still getting a couple of e-mail requests per week for my slides and materials, so they’re posted below for posterity.

The slides:

And the video:

As always, I welcome your questions or comments.

E-mail automation, questions and answers

Welcome! I’ve gotten several hundred e-mails about my e-mail management code. I do want to share it as soon as possible. Here are the answers to the most common questions.

Why separate scripts?

My philosophy is based on the unix command-line tool model; Each script should be simple and useful alone, but when combined together they become extremely powerful.

Why don’t we have the code yet?!

I had no idea the talk would be shared beyond the couple hundred people in the audience or that it would be so popular! I started my position at the same day I gave that IgniteNYC presentation, and I also have some other awesome projects that are competing for time.

I have to admit that the trained classifiers are all based on my personal data and were also trained mostly through tweaking in ipython. I need to finish a generic framework for people to train their own filters before I can publish that piece of the system. I promise, I’m working on it.

Keep nagging me — nagging works!

Are you going to commercialize your scripts / can I invest?

I have certainly thought about commercializing the application, but I’m uncomfortable asking people to give me access to their personal e-mail data (even if there are very interesting things to be learned by aggregate analysis).

Just imagine how much more creative, interesting work could be done if we could partially free the world from the e-mail workload… that alone is worth making the code open.

How does it work? What tech are you using?

The scripts run on my gmail account through IMAP (and should work with any IMAP interface, though I’m sure there is debugging to be done). They live on a Linode VPS and run individually via cron jobs.

Most of the scripts are in Python. I use NLTK and libsvm (in addition to my own code) for the data analysis.

I primarily use the gmail web interface (though I’ve flipflopped between and Thunderbird for a while), and the only cost is that I have to manually reload the page to see new labels and new drafts appear.

Do your scripts go mad with power and e-mail inappropriately? Are you some kinda robot?

I have all of the scripts deposit suggested responses in the draft folder, and then I use the gmail “multiple inboxes” feature to keep the draft folder up in the UI. It’s very easy to go through and modify or delete responses before they are sent.

Of course, I only thought of that after one of the script DID go a bit mad. I’m still sorry about that, Mom.

I’m not a robot, though of course I would say that anyway! The point of the automation is to remove the stupid parts of e-mail and leave me free to personally address the interesting messages.

If you’ve read this far, there are a few things I would love your feedback on:

What’s a kickass name for this project?

More important, which features/scripts are you most interested in seeing first? The nag script is about ready to go, but I’d like to know where to focus my time.


Stop talking, start coding

I read Out of the Loop in Silicon Valley in the NYTimes today, which explores how and why women are under-repesented in tech startups. From the number of retweets I saw and the clicks through links (12,579 at the time of this posting), it’s been getting a lot of attention.

There are some very strong, compelling themes in this article. Computer science and engineering to have an “image problem”; the way we teach math to elementary school students is horrible and turns way too many away.

I don’t want to nitpick the article, but there are a few statements that reinforce the very damaging stereotypes that the article sets out to dispel.

“When women take on the challenges of an engineering or computer science education in college, some studies suggest that they struggle against a distinct set of personal, psycho-social issues… Even women who soldier though [sic] demanding computer science and engineering programs in college…”

I’ve been both a computer science student and a computer science professor. I have not seen any evidence that the average undergraduate computer science education is harder than physics, math, chemistry, biology, or many other ‘hard’ disciplines with a much stronger gender balance. Implying that women are unwilling to meet the intellectual challenges of the discipline is bullshit.

“Girls have certain family goals they want to accomplish,” she says. “Working 60 hours a week is difficult because it requires a life sacrifice.”

The men that I know and work with also have wonderful personal lives. Working 60 hour weeks is a sacrifice for them, too.

Please read the whole article. Let me know what you think when you see the material in context.

I’m going to make the assumption that we all believe that having more women in technology is a Good Thingtm.

Many groups have popped up that support women in technology, like Girls in Tech, She’s Geeky, and many others (enumerated in Digiphile’s thoughtful post Why Including women matters for the future of technology and society). More often than not, these groups are the canned food drives of the women in technology movement. They make you feel better, they might do a little good, but they offer no fundamental change to the system that created the problem in the first place.

The Grace Hopper Celebration of Women in Computing does this well. GHC invites women to come to one place, be together, and do science together.

We don’t need affirmative action for women in tech. We need to create experiences that nurture women and men so that more people are inspired to can create beautiful, technical things together.