Data Driven: Creating a Data Culture


I’m excited that my short book, Data Driven: Creating a Data Culture, co-authored with DJ Patil, is out in the world!

We talk about processes and qualities of strong data teams and how to design for these cultural practices in an organization.

The book is available for free on O’Reilly’s site, and soon on Amazon. I hope you enjoy it.

Need actual random numbers? Meet the NIST randomness beacon.

I wrote a python module that wraps that NIST Randomness Beacon, making it simple to get truly random numbers in python.

It’s easy to use:

b = Beacon()
print b.last_record()
print b.previous_record()
#and so on

There’s also a handy generator for getting a set of n random numbers.

(One of the best gifts I ever got was a copy of 1,000,000 Random Numbers, and I’ve been intrigued ever since.)

Please note that this the randomness beacon is not intended to be a source of cryptographic keys — indeed, it’s a public set of numbers, so I wouldn’t recommend doing anything that could be compromised by someone else having the access to the exact same set of numbers. Rather, this is interesting precisely for the scientific opportunities that are possible when you have a random but public set of inputs.

One Random Tweet, please.

One random tweet.

One random tweet.

It’s easy to believe that other people use social networks in the same way that you do. Your friends largely do use them the same way, which gives us an even more biased perspective.

Unfortunately, most networks don’t provide a way to explore representative communications that you’re not connected to.

Well, now you can! One random tweet, please.

Update: There were some slight technical difficulties due to hitting Twitter’s oembed rate limit. They should be repaired now.

(Note: between this and I’m on a bit of a random kick lately. There’s a method to this madness!)

Introbot: A Script to Ease the Process of Writing Introductory E-mails

The introductory e-mail is a message where I introduce two (or more) people who have yet to meet each other. It generally takes the highly structured form, “Salutation A and B! A does X. B does Y. You should meet for reason Z. Valediction.”

I almost always do opt-in intros, where I’ll write to each party separately and make sure it’s okay if I share their information, and explain why I think it’s worth their time. I find this approach to be more respectful of people’s privacy and busy schedules.

That means that by the time they get the formal introduction, they generally know what’s going on. Still, I find these messages peculiarly stressful.

Stressful task? Check. Highly-structured output? Check. Repeating the same information over and over? Check.  This calls for … a script!

You can grab the code on github here.

The first step is to set your valediction and names in the file, then to add people that you want to introduce with a brief description. Finally, you need only type something like:

python alan betty

to generate and copy to your clipboard (on a mac, anyway):

Alan & Betty, please meet.

Alan is the fake director at fake company, where he does fake things.

Betty is the fake person who does other fake interesting things.

I think you'll find quite a lot to talk about.



Paste it into your favorite e-mail client, send, and relax.

This is how I mentally organize introductions, but I have no idea if it’ll work for anyone else. Would you ever use something like this? What does it need to be useful for you?

Need Data? Start Here

Data scientists need data, and good data is hard to find. I put together this bitly bundle of research quality data sets to collect as many useful data sets as possible in one place. The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition, so you know there’s something that will intrigue anyone.

Have one to add? Let me know!

(I’ve shared the bundle before, but this post can act as unofficial homepage for it.)

Book Book — Goose!

I like to read. I love bookstores, I like to wander, and to find things that I didn’t know existed. But bookstores don’t have every book that exists. Amazon has most books, but search is a terrible way to discover new things. Amazon’s recommendations most likely maximize purchases, but are a terrible way to find something you didn’t know you were looking for (look at a book like Effective JavaScript, for example, and you get recommendations for Async JavaScriptBuilding Node Applications with MongoDB and BackboneJavaScript Enlightenment). Similarly, top 100 lists are great at showing you popular things that you’re probably more likely to buy, but not very good at helping you find a book with a story or idea that’s unlike anything you’ve read lately.

There must be a better way to explore books.

A random way to explore books would be a good way to start.

Hence, bookbookgoose. Browse randomly. Enjoy!

 Hint: use the ‘n’ key to go forward quickly. I find about .2% of the books are awesome.

Update: you can now find @bookbookgoose on Twitter, sharing one random book per hour.

Update: Dustin Kurtz at Melville House had an eloquent writeup of the beauty in this random literature.

DataGotham: The Empire State of Data

I’m extremely excited about DataGotham, a conference that I’m co-hosting with friends and fellow New York data nerds Drew, John, and Mike.

DataGotham is a celebration of the NYC data community, and will bring together professionals from all industries in New York that are built around data, from finance to fashion and from startups to the Fortune 500 and government. The event is September 13th – 14th at NYU, with tutorials and The Great Data Extravaganza Show (with cocktails!) at the Tribeca Rooftop Thursday evening, and a single track conference Friday. Our speakers and sponsors are all amazing. You can register now.

While DataGotham is definitely a labor of love, there are numerous reasons to do it. I believe that New York has a distinct data philosophy — the study of human behavior — that is unique and should be celebrated. We have an large population of local badass data hackers, and our community will only grow stronger if we can build relationships across the industry divides. Finally, there’s an opportunity for all of us to influence the future of data science, and this event will highlight some voices that might not otherwise be heard.

I hope to see you there!

(Also, anyone who made it this far through can register with code “dataGothamist” for 25% off :) )

Hacking the Food System: The Ultimate Chocolate Chip Cookie

liquid n2 ice cream

Food+Tech Connect is putting together a fun series of essays where technologists and foodies share their opinions on how to hack to the food system.

They also had a great party, with liquid nitrogen ice cream and other very cool foods.

I’m honored to have been asked to participate, especially since food and tech are two of my favorite things! I decided to write about a hack that I did about three years ago, where I wrote a parser and built a statistical model of chocolate chip cookie recipes that I crawled off of the web.

I’d like to tell you the story of the Ultimate Chocolate Chip Cookie Recipe.

This isn’t the Neiman Marcus $65,000 cookie recipe. Nor is it the classic Toll House Chocolate Chip Cookie recipe that we all grew up with (and, though the instructions are all the same, my Mom made the best). This is a recipe learned from thousands of bakers around the world, via love and math.

Read the rest of the essay on their site. I’d love to know what you think!

Special thanks to Matt LeMay for language-clarifying edits to the piece.

Gitmarks: a peer-to-peer bookmarking system

Several months ago I was looking for a command-line solution for group bookmark sharing. I couldn’t find one, so I coded up a quick python script that runs on top of git. It’s very much a hack that takes advantage of git to manage users, preserve the URL, the tags, the description of the URL (in the commit message) and also includes the content itself (so it’s grep-able later). If you put it on github, you get the additional commenting and collaboration features. You can check out my original code here.

I’m very excited that Far McKon has picked up the project and has a great vision for where it can go. If you’re interested in hacking on it with him, let him know!

Folks: I’m working on a p2p bookmark sharing based on @hmason ‘s code. Python/git based. Want to help? #opensource @openhatchless than a minute ago via TweetDeck

A quick twitter bot, @bc_l

Several months ago, on a whim inspired by an off-hand comment from Chris, I created a bot to bring the wonders of the Unix bc language to twitter.

bc is a command-line calculator that’s fast and has the capacity to do some fairly complex math.

Try it out on the command line:

echo '100 / 10' | bc -l

…Or by sending a direct message to bc_l (if you follow bc_l it will follow you back within a few hours).

I released the code under GPL, and it’s available on github:

John Cook mentions the bot and makes some great observations in his post three surprises with bc.