Conference: Strata NY 2011

The first Strata Conference in New York just wound up. It was a five day expo of business, data, and tech, and brought a ton of great people in the data community to New York.

Thanks so much to Edd and Alistair and everyone whose hard work made this possible!

My talk, Short URLs, Big Data: Learning in Realtime is already online:

And the slides are up on Slideshare:


bash: get http response codes for a list of URLs

I had a file with a list of URLs, and I wanted to grab the HTTP response codes for each of them. I’m sure this quick bash script isn’t the best way to do it, but it works, and I’ll probably want to do this again someday, so here it is!

#!/bin/bash

while read line
do
    echo $(curl --write-out %{http_code} --silent --output /dev/null $line)
done <$1

What do you read that changes the way you think?

A friend asked me which of three startup business books she should read. Obama’s reading list since entering office has nothing surprising on it.

The most valuable books I read this year have been stories of things very different from what I spend most of my time thinking about.

One of my favorites was China Meiville’s The City & The City, which I loved for the ambition and artistry, and another was Simon Winchester’s The Meaning of Everything: The Story of the Oxford English Dictionary, which I loved for the descriptions of creating an analog, scalable information system.

What have you read recently that was really great?

Edit: Thanks for the recommendations! There are also a bunch over on Google Plus.


Uses This

I’m honored to have my tools of choice featured on Uses This!


My Head is Open Source!

Last night I visited friends at Makerbot, where artist-in-residence Jonathan Monaghan scanned my head with a high-resolution laser scanner.

The model is available on Thingiverse and can be printed on your friendly neighborhood makerbot or other 3d printer.

There are lots of other awesome models of people and things to play with, including Stephen Colbert’s head.

I look forward to the emergence of plastic clone head armies!

Edit: Please note: thanks for asking, but brains are not included.


An Introduction to Machine Learning with Web Data is now available!

I’m really excited that An Introduction to Machine Learning with Web Data is now available for purchase!

This is a 2 hour and 43 minute instructional video that walks you through basic machine learning algorithms, first theoretically and mathematically, and then with Python example code (which is available here).

This video is an instructional take and builds on the material I covered in my Strange Loop 2010 keynote Machine Learning: A Love Story and the Data Bootcamp I did with Joe Adler, Drew Conway, and Jake Hofman at the Strata Conference in February.

I’d also like to acknowledge the many collaborators, colleagues, and friends who have made definite contributions to my thinking about this material and how best to present it, particularly Chris Wiggins who co-authored A Taxonomy of Data Science and Andrew, Dennis, Jan, Jesse, and Julie, the members of the studio audience for the class (who were amazing).

If you like it, please leave it a good review! As always, questions and comments are welcome here or by e-mail.


How to get a random line from a file in bash.

I work with a lot of data, and while I’d like to pretend it’s all in upside-down quasi-indexed b-tree rocket ships or some other advanced database, the truth is that much of it is in text files. I often find myself wanting to see a random line from one of these files, just to get a sense of what the data looks like.

I thought there must be an easy bash way to do this, but I couldn’t find it (‘shuf’ isn’t installed on my server), so I turned to twitter, and now I’m pleased to present more methods for finding a random line than you ever expected!

sort -R | head -n 1

If you can use this, do so! If it isn’t available, consider one of the following commands:

@andrewgilmartin suggests using awk:

awk 'BEGIN { srand() } rand() >= 0.5 { print; exit }'

@devinteske offered one of the easiest to solutions to read:

tail -$((RANDOM/(32767/`wc -l</etc/group|tr -d ' '`))) /etc/group|head -1

@terrycojones piped up with this gem:

split -l 1 < file; cat `for i in x*; do echo $RANDOM $i; done | sort -n | cut -f2 -d' ' | head -n 1`; rm x*

@FirefighterBlu3 does sed++:

file=/etc/passwd; lc="$(($RANDOM % $(wc -l $file|awk '{print $1}')))"; sed -n "${lc}p" $file

@burleyarch collects the whole set:

f=YOUR_FILE; n=$(expr $RANDOM \* `cat $f | wc -l` \/ 32768 + 1); head -n $n $f | tail -1

All of the options using $RANDOM should be used with the understanding that the max possible value is 32767, so it will only be random on files that have fewer than 32,767 lines.

@xn with an excellent use of cut:

awk 'BEGIN { OFS="\t"; srand() } { print rand(), $0 }' | sort -n | cut -f2- | head -1

@paulrbrown with a badass example of od:

echo `cat /dev/urandom | od -N4 -An -i `' % '`wc -l < file` | bc | sed 's/-//g' | xargs -I % head -n % file | tail -n 1

And finally, from @alexlines, who actually developed his solution into a blog post:

dd if=file skip=$(expr $(date +%N) \% $(stat -c "%s" file)) ibs=1 count=200 2>/dev/null|sed -n '2{p;q;}'

And, of course, @ceonyc brought some comic relief:

@hmason Good bash one-liner? Take my code, please.


Gitmarks: a peer-to-peer bookmarking system

Several months ago I was looking for a command-line solution for group bookmark sharing. I couldn’t find one, so I coded up a quick python script that runs on top of git. It’s very much a hack that takes advantage of git to manage users, preserve the URL, the tags, the description of the URL (in the commit message) and also includes the content itself (so it’s grep-able later). If you put it on github, you get the additional commenting and collaboration features. You can check out my original code here.

I’m very excited that Far McKon has picked up the project and has a great vision for where it can go. If you’re interested in hacking on it with him, let him know!


Folks: I’m working on a p2p bookmark sharing based on @hmason ‘s code. Python/git based. Want to help? #opensource @openhatchless than a minute ago via TweetDeck


Be Ballsy.

The things that are hardest to make yourself do are often the ones that end up being the most rewarding.

(By rewarding I mean they lead to the kinds of experiences where you learn something new, get to meet amazing people, and generally have opportunities to do things you never would have imagined doing before. Rewarding does not only mean money.)

I was thinking about this at PyCon, after someone asked me how it felt, as a woman, to get up and give a technical talk to approximately 1,400 men. Five years ago I couldn’t have imagined myself doing something like that, but a series of small chances, risks, and experiences have led me here. And it was a ton of fun!

I thought about this again when Bryce sent me a link to this post. OATV is hiring an analyst, and they haven’t received applications from women. The OATV team is a wonderful, smart, and energetic group of people and I’m certain that working with them would be a mind- and life-changing experience. To anyone who would be interested, but won’t apply because it’s a reach — forget about that, and apply now.

My mother always asked me, “What’s the worst that could happen?” It turns out that the worst possible outcome is better than doing nothing risky at all.


Interview on Silicon Angle TV

You can catch an interview (or see a writeup) that I did live from the Strata Conference on Silicon Angle TV! We talk about bit.ly data, politics, and touch briefly on some of the interesting problems that we’re working on.

Full video:

[I removed the video embed because it was annoyingly auto-playing in some browsers. You can still see the video here.]