Please join us for DataGotham 2014! We’ll be hosting the conference Friday, Sep 26th at the Broad Street Ballroom in the financial district in New York City. We look forward to getting New York’s data community together and having a great time.
The Call for Proposals is now open, and we want you to speak. Our speakers come from every industry and type of background where people use data, and the best talks are ones where you tell us your real experience with a data project. What problem were you trying to solve? What went wrong? What went right? How does the story end? Or did it?
If you think you might have a story to tell but aren’t sure, drop me a note. I’m happy to help you frame a successful talk, and we welcome speakers at all levels of experience. Some of our most popular talks have been from first-time speakers!
I spent a few minutes this week putting together a quick script to pull data from the Locu API. Locu has done the hard work of gathering and parsing menus from around the US and has a lot of interesting data (and a good data team).
The API is easy to query by menu item (like “cheeseburger”, my favorite) and by running my little script I quickly had data for the prices of cheeseburgers in my set of zip codes (the 100 most populated metro areas in the US).
I’m a big fan of Pete Warden’s OpenHeatMap tool for making quick map visualizations, and was able to come up with the following:
The blue map is the average price of a cheeseburger by zip, with the red one showing the average price of pizza. The most expensive average cheeseburger can be found in Santa Clara, CA, ironically the city currently hosting the Strata data science conference this week. Have fun with those $18 cheeseburgers, colleagues!
You can also see some fun words in the pizza topping options:
In this plot, the x-axis is roughly geographic (ordered by zip code) and the y-axis is in order of popularity, with pepperoni being the most popular common pizza topping, and anchovies among the least.
This is just a quick look at some data, but hopefully it’ll encourage you to play with your food (data)!
It turns out that it’s pretty easy to co-opt Twitter’s Lead Generation card for anything where you want to gather a bunch of e-mail addresses from your Twitter community. I was looking for people willing to alpha test a little side project of mine, and it worked great and didn’t cost anything.
The tweet itself:
Love tech discussion but looking for a better community? Help me beta test a side project! https://t.co/H3DYjbCy19
— Hilary Mason (@hmason) December 12, 2013
I created it pretty easily:
- First, go to ads.twitter.com, log in, and go to “creatives”, then “cards”.
- Click “Create Lead Generation Card”. It’s a big blue button.
- You can include a title and a short description. Curiously, you can also include a 600px by 150px image. This seems like an opportunity to say a bit more about what you’re doing.
- You also need to configure a fallback URL, which is where people will go if they don’t have a Twitter client capable of the one-click signup. I used a Google form, which let people give me their e-mail addresses directly.
And that’s it! Tweet enthusiastically, then wait patiently, because if you don’t integrate your Twitter card with your CRM, you have to wait ~24 hours for the download link to appear in the Twitter cards manager. The resulting CSV looks like this:
Timestamp,User id,Name,Twitter handle,Email 2013-12-12T23:36:05,774485611,Robots Rule,RobotzRule,firstname.lastname@example.org
A little bit of awk later and I had a list of e-mails ready to go. I ended up getting 49 responses through the Google form and 197 through the Twitter card. It was easy and I’ll definitely do this next time I need to collect people’s e-mail addresses for a project.
Yesterday I asked people on twitter for recommendations for things to read to improve as a programmer. I’m looking mainly for things on the philosophy side of software engineering. I do realize that practice is the most important thing, but sometimes you run into a design question and it’s always helpful to realize that very smart people have, indeed, thought about these things before.
If you see something that you think should be included, please do let me know in the comments and I’ll add it to the list.
The talks are a wide perspective on the interesting work happening around data in New York, and I believe you’ll enjoy all of them!
The New York Times has a story this morning on the growing use of mugshot data for, essentially, extortion. These sites scrape mugshots off of public records databases, use SEO techniques to rank highly in Google searches for people’s names, and then charge those featured in the image to have the pages removed. Many of the people featured were never even convicted of a crime.
What the mugshot story demonstrates but never says explicitly is that data is no longer just private or public, but often exists in an in-between state, where the public-ness of the data is a function of how much work is required to find it.
Let’s say you’re actually doing a background check on someone you are going on a date with (one of the use cases the operators of these sites claim is common). Before online systems, you could physically go to the various records offices, sometimes in each town, to request information about them. Given that there are ~20,000 municipalities in the United States, just doing a check would take the unreasonable investment of days.
Before mugshot sites, you had to actually visit each state’s database, figure out how to query it, and assemble the results. Now we’re looking at an investment of hours, instead of days. It’s possible, but you must be highly motivated.
Now you just search, and this information is there. It is just as public as it was before, but the cost to access has become a matter of seconds, not hours or days, and we could imagine that you might be googling your date to find something else about him and instead stumble on the mugshot image. The cost for accessing the data is so trivial that can come up as part of an adjacent task.
The debate around fixing this problem has focused on whether the data should be removed from the public entirely. I’d like to see this conversation reframed around how we maintain the friction and cost to access technically public data such that it is no longer economically feasible to run these sorts of aggregated extortion sites while still maintaining the ability of journalists and concerned citizens to explore the records as necessary for their work.
Everyone does realize that it's not about teaching people to CODE as much as it is about teaching people to THINK … right?
— Hilary Mason (@hmason) September 17, 2013
I’m a huge fan of the movement to teach people, especially kids, to code.
When you learn to code, you’re learning to think precisely and analytically about a quirky world. It doesn’t really matter which particular technology you learn, as long as you are learning to solve the underlying logical problems. If a student becomes a professional engineer, their programming ability will rise above the details of the language, anyway. And if they don’t, they will have learned to reason logically, a skill that’s invaluable no matter what they end up doing.
That you can apparently complete a three month Ruby bootcamp and get a job today is an artifact of a bizarre employment market, and likely unsustainable. But by dedicating three months to learning to think in a logical framework, you’ll also gain an ability that will open opportunities for you for the rest of your life.
Registration is open for DataGotham 2013, our second annual New York data community conference, September 12th and 13th. The core of the conference is a series of brilliant data practitioners telling the stories about what they work on. The content is technically-oriented but not all deeply technical, and we really welcome anyone curious about how New York companies and institutions are pushing the boundaries on data to attend.
We have two goals for the conference. The primary goal is to connect people in the greater New York data community who are working on interesting things. If our community is strong and supportive, we will all do better work.
Our second goal is to highlight the amazing working happening here, so that people near and far will realize that New York is the best place in the world to do data science.
Come join us to hear these stories firsthand and meet fellow data-minded practitioners! Register now:
(Readers of this blog can use discount code “IheartNYC” for 10% off, and I hope to see you there!)
In 2008, cuil, a search engine startup, displayed my bio alongside a photo of deceased actress Hilary Mason. In January 2013, Bing confused us, this time putting my photo next to her bio (they fixed it after a suitable amount of mocking on Twitter).
Today I win the internet?
If you zoom in on the bio section, you can clearly see that it’s her bio with a photo of me (originally from Crain’s New York 40 under Forty). Further, if you go into her filmography, you continue to see my photo.
I’m most proud of my starring role in the amazing film Robot Jox. (bottom right of the image below)
I know that entity disambiguation is a hard problem. I’ve worked on it, though never with the kind of resources that I imagine Google can bring to it. And yet, this is absurd!
Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.
It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.
Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.
Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).