I’m excited that my short book, Data Driven: Creating a Data Culture, co-authored with DJ Patil, is out in the world!
We talk about processes and qualities of strong data teams and how to design for these cultural practices in an organization.
The book is available for free on O’Reilly’s site, and soon on Amazon. I hope you enjoy it.
Something fun to start off the new year — I found myself in Google Street View!
I spent a few minutes this week putting together a quick script to pull data from the Locu API. Locu has done the hard work of gathering and parsing menus from around the US and has a lot of interesting data (and a good data team).
The API is easy to query by menu item (like “cheeseburger”, my favorite) and by running my little script I quickly had data for the prices of cheeseburgers in my set of zip codes (the 100 most populated metro areas in the US).
I’m a big fan of Pete Warden’s OpenHeatMap tool for making quick map visualizations, and was able to come up with the following:
The blue map is the average price of a cheeseburger by zip, with the red one showing the average price of pizza. The most expensive average cheeseburger can be found in Santa Clara, CA, ironically the city currently hosting the Strata data science conference this week. Have fun with those $18 cheeseburgers, colleagues!
You can also see some fun words in the pizza topping options:
In this plot, the x-axis is roughly geographic (ordered by zip code) and the y-axis is in order of popularity, with pepperoni being the most popular common pizza topping, and anchovies among the least.
This is just a quick look at some data, but hopefully it’ll encourage you to play with your food (data)!
It turns out that it’s pretty easy to co-opt Twitter’s Lead Generation card for anything where you want to gather a bunch of e-mail addresses from your Twitter community. I was looking for people willing to alpha test a little side project of mine, and it worked great and didn’t cost anything.
The tweet itself:
Love tech discussion but looking for a better community? Help me beta test a side project! https://t.co/H3DYjbCy19
— Hilary Mason (@hmason) December 12, 2013
I created it pretty easily:
- First, go to ads.twitter.com, log in, and go to “creatives”, then “cards”.
- Click “Create Lead Generation Card”. It’s a big blue button.
- You can include a title and a short description. Curiously, you can also include a 600px by 150px image. This seems like an opportunity to say a bit more about what you’re doing.
- You also need to configure a fallback URL, which is where people will go if they don’t have a Twitter client capable of the one-click signup. I used a Google form, which let people give me their e-mail addresses directly.
And that’s it! Tweet enthusiastically, then wait patiently, because if you don’t integrate your Twitter card with your CRM, you have to wait ~24 hours for the download link to appear in the Twitter cards manager. The resulting CSV looks like this:
Timestamp,User id,Name,Twitter handle,Email 2013-12-12T23:36:05,774485611,Robots Rule,RobotzRule,email@example.com
A little bit of awk later and I had a list of e-mails ready to go. I ended up getting 49 responses through the Google form and 197 through the Twitter card. It was easy and I’ll definitely do this next time I need to collect people’s e-mail addresses for a project.
If you’ve had a talk proposal accepted or been invited to speak at an event, you’ll usually get a chance to chat with the organizers before you show up to give your talk.
While you probably have a good idea of the topic of your talk (if you don’t, that’s a post for another day!), event organizers can be invaluable in helping you frame a talk that will succeed with their audience. They are on your side and they want you to do great, or they wouldn’t be hosting you at their event.
These are two questions that I always ask the organizers before I speak.
Question 1: Who will be in the audience?
Knowing the basic demographics of the audience is necessary to make sure you’re speaking at the right level and tuning the cultural references and humor for the room. I often speak to audiences of highly technical engineers and to audiences of business folks about the same topics. These are very different talks.
You may already have a good sense of who will be at the event, but getting the organizer to tell you explicitly also tells you which population they are crafting the event to serve. It’s helpful to know who they consider to be the most important people in the room.
Question 2: What does a win for my talk look like to you?
This question prompts the organizers to tell you what they are hoping people in the audience will take away from your talk. Their response gives you more information about how you can successfully fit your talk into the overall event and specific goals.
For example, responses I’ve gotten have ranged from “I want people to feel inspired”, which tells me to emphasize the forward-looking optimistic topics that I plan to talk about, to “I hope they learn one practical trick they can use in their work immediately”, which tells me to focus on clarifying specific techniques, and so on.
The event organizers know their event better than you do, so anything you can learn from them ahead of time will be useful.
Yesterday I asked people on twitter for recommendations for things to read to improve as a programmer. I’m looking mainly for things on the philosophy side of software engineering. I do realize that practice is the most important thing, but sometimes you run into a design question and it’s always helpful to realize that very smart people have, indeed, thought about these things before.
If you see something that you think should be included, please do let me know in the comments and I’ll add it to the list.
The talks are a wide perspective on the interesting work happening around data in New York, and I believe you’ll enjoy all of them!
The New York Times has a story this morning on the growing use of mugshot data for, essentially, extortion. These sites scrape mugshots off of public records databases, use SEO techniques to rank highly in Google searches for people’s names, and then charge those featured in the image to have the pages removed. Many of the people featured were never even convicted of a crime.
What the mugshot story demonstrates but never says explicitly is that data is no longer just private or public, but often exists in an in-between state, where the public-ness of the data is a function of how much work is required to find it.
Let’s say you’re actually doing a background check on someone you are going on a date with (one of the use cases the operators of these sites claim is common). Before online systems, you could physically go to the various records offices, sometimes in each town, to request information about them. Given that there are ~20,000 municipalities in the United States, just doing a check would take the unreasonable investment of days.
Before mugshot sites, you had to actually visit each state’s database, figure out how to query it, and assemble the results. Now we’re looking at an investment of hours, instead of days. It’s possible, but you must be highly motivated.
Now you just search, and this information is there. It is just as public as it was before, but the cost to access has become a matter of seconds, not hours or days, and we could imagine that you might be googling your date to find something else about him and instead stumble on the mugshot image. The cost for accessing the data is so trivial that can come up as part of an adjacent task.
The debate around fixing this problem has focused on whether the data should be removed from the public entirely. I’d like to see this conversation reframed around how we maintain the friction and cost to access technically public data such that it is no longer economically feasible to run these sorts of aggregated extortion sites while still maintaining the ability of journalists and concerned citizens to explore the records as necessary for their work.
It’s easy to use:
b = Beacon() print b.last_record() print b.previous_record() #and so on
There’s also a handy generator for getting a set of n random numbers.
(One of the best gifts I ever got was a copy of 1,000,000 Random Numbers, and I’ve been intrigued ever since.)
Please note that this the randomness beacon is not intended to be a source of cryptographic keys — indeed, it’s a public set of numbers, so I wouldn’t recommend doing anything that could be compromised by someone else having the access to the exact same set of numbers. Rather, this is interesting precisely for the scientific opportunities that are possible when you have a random but public set of inputs.
Everyone does realize that it's not about teaching people to CODE as much as it is about teaching people to THINK … right?
— Hilary Mason (@hmason) September 17, 2013
I’m a huge fan of the movement to teach people, especially kids, to code.
When you learn to code, you’re learning to think precisely and analytically about a quirky world. It doesn’t really matter which particular technology you learn, as long as you are learning to solve the underlying logical problems. If a student becomes a professional engineer, their programming ability will rise above the details of the language, anyway. And if they don’t, they will have learned to reason logically, a skill that’s invaluable no matter what they end up doing.
That you can apparently complete a three month Ruby bootcamp and get a job today is an artifact of a bizarre employment market, and likely unsustainable. But by dedicating three months to learning to think in a logical framework, you’ll also gain an ability that will open opportunities for you for the rest of your life.