The New York Times has a story this morning on the growing use of mugshot data for, essentially, extortion. These sites scrape mugshots off of public records databases, use SEO techniques to rank highly in Google searches for people’s names, and then charge those featured in the image to have the pages removed. Many of the people featured were never even convicted of a crime.
What the mugshot story demonstrates but never says explicitly is that data is no longer just private or public, but often exists in an in-between state, where the public-ness of the data is a function of how much work is required to find it.
Let’s say you’re actually doing a background check on someone you are going on a date with (one of the use cases the operators of these sites claim is common). Before online systems, you could physically go to the various records offices, sometimes in each town, to request information about them. Given that there are ~20,000 municipalities in the United States, just doing a check would take the unreasonable investment of days.
Before mugshot sites, you had to actually visit each state’s database, figure out how to query it, and assemble the results. Now we’re looking at an investment of hours, instead of days. It’s possible, but you must be highly motivated.
Now you just search, and this information is there. It is just as public as it was before, but the cost to access has become a matter of seconds, not hours or days, and we could imagine that you might be googling your date to find something else about him and instead stumble on the mugshot image. The cost for accessing the data is so trivial that can come up as part of an adjacent task.
The debate around fixing this problem has focused on whether the data should be removed from the public entirely. I’d like to see this conversation reframed around how we maintain the friction and cost to access technically public data such that it is no longer economically feasible to run these sorts of aggregated extortion sites while still maintaining the ability of journalists and concerned citizens to explore the records as necessary for their work.
Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.
It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.
Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.
Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).
This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list.
You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company).
The easiest way to share data is through your own API. Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad student gather the data) over the course of weeks. Using the API also has the side-effect of having the researchers bound only by your ToS.
When it’s not practical to use your API, create a data dump. At bitly, we have one open dataset, the 1.usa.gov data (clicks on 1.usa.gov links, stripped of all personally identifiable data). If I’m approached by students who are looking for a dataset for a homework project or someone who just wants to play, I point them there.
Often, though, researchers are investigating specific questions that require a specific sample of the data. We’ve worked with people investigating the spread of malware, studying the use of social networks during the Arab Spring, looking at how effective the media was during the Fukushima crisis in Japan, just for a few examples.
(Academics: every so often someone asks for “a copy of your database”. That’s not feasible for technical or business reasons. It’s more productive to describe your work and how you think the data would help. We’ll figure out what’s possible.)
When we do create and share a dataset, the data we share is always stripped of any potentially identifying information. Still, we require researchers to sign an NDA. The NDA basically states:
- You will not share this data with anyone.
This is not ideal for the research community, but necessary, according to our lawyers. We simply ask anyone who wants a copy of the same dataset for academic pursuits to sign the same NDA.
- You may publish whatever you like.
Academic freedom FTW.
- We reserve the right to use anything that you invent while using our data.
This is actually not because we expect people to come up with anything commercially useful to us that we haven’t already thought of, but to protect us should someone assert later that we used one of their ideas in a product.
And that’s it! To summarize, how to share data, in order of preference:
- Your API
- A public dataset
- A not-public custom-built and legally protected dataset
Now, go out, share, and research.
Data scientists need data, and good data is hard to find. I put together this bitly bundle of research quality data sets to collect as many useful data sets as possible in one place. The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition, so you know there’s something that will intrigue anyone.
Have one to add? Let me know!
(I’ve shared the bundle before, but this post can act as unofficial homepage for it.)
We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.
First, we share the analysis that we do at the link level. Every developer using data from the web has the same set of problems — what are the topics of those URLs? What are their keywords? Why should you rebuild this infrastructure when we’ve done it already? We’ve also added in a few bits of bitly magic — for example, you can use the /v3/link/location endpoint to see where in the world people are consuming that information from.
Second, we’ve opened up access to a realtime search engine. That’s an actual search engine that returns results ranked by current attention and popularity. Links are only retained for 24 hours, so you know that anything you see is actively receiving attention. If you think of bitly as a stream of stories that people are paying attention to, this search API offers you the ability to filter the stream by criteria like domain, topic, or location (“food” links from Brooklyn is one of my favorites) and pull out the content, in realtime, that meets your criteria. You can test it out with a human-friendly interface at rt.ly.
Finally, we asked the question — what is the world paying attention to right now? We have a system that tracks the rate of clicks – a proxy for attention – on phrases contained within the URLs being clicked through bitly. Then we can look and see which phrases are currently receiving a disproportionate amount of attention. We call these “bursting phrases”, and you can access them with the /v3/realtime/bursting_phrases endpoint. It’s analogous to Twitter’s trending topics, but based on attention (what people do), not shares (what they say), and across the entire social web.
I’m extremely excited to see what people build with these tools.
I was visiting my grandparents yesterday, and my grandfather asked for help e-mailing an article to some of his friends. I asked him to show me how he normally writes an e-mail, and taught him the magic of copy and paste (it is amazing if you haven’t seen it before) but I noticed that in the course of sending an e-mail and checking on his inbox, he clicked on this ad three times.
When I asked about it, he didn’t realize he had clicked the ad — he just thought these screens popped up randomly — because he didn’t realize that his hands were shaking on the trackpad.
I’m sure the data says that that’s the optimal place on the screen for the ad. I’m sure tons of people ‘click’ on it. I’m also sure it’s wrong, and it results in a terrible experience.
It’s common sense, but experiences like this are great reminders that data only takes us so far, and creativity and clear thinking are always required to find the best solutions.
Yahoo, please fix this!
I just got home from the Web 2.0 Summit, a three-day conference that was packed with announcements, interesting ideas, and good conversations.
My short talk, The Secrets of our Data Subconscious, touches on how the data we generate online interactions with the physical world spatially and through time, and on the relationships between the things we consume (in private) and the things we broadcast (in public).
I gave a talk called A Data-driven Look at the Realtime Web Ecosystem at the Web2Expo SF conference in May in San Francisco. I attempted to highlight some of the interesting facets of the bit.ly data set, and it appeared to be well-received (showing up on TechCrunch, ZDNet, and a few other places).
I attended the full conference, and it was great. The attendees were extremely international and I met a ton of fascinating people.
I’m still getting a couple of e-mail requests per week for my slides and materials, so they’re posted below for posterity.
And the video:
As always, I welcome your questions or comments.
I’ve found myself in need of a name distribution for a few projects recently, so I thought I would post it here so I won’t have to go looking for it again.
The data is available from the US Census Bureau (from 1990 census) here, and I have it here in a friendly MySQL *.sql format (it will create the tables and insert the data). There are three tables: male first names, female first names, and surnames.
I’ve noted several issues in the data that are likely the result of typos, so make sure to do your own validation if your application requires it.
The format is simple:
- the name
- frequency (percentage of people in the sampled population with that name)
- cumulative frequency (as you read down the list, the percentage of total population covered)
If you want to use this to generate a random name, you can do so very easily with a query like this:
SELECT name FROM ref_census_surnames n ORDER BY (RAND() * (n.freq + .01)) LIMIT 0,1;
Download it here: census_names.tar.gz
I gave a talk at the NYC Python Meetup on July 29 on Practical Data Analysis in Python.
I tend to use my slides for visual representations of the concepts I’m discussing, so there’s a lot of content that was in the presentation that you unfortunately won’t see here.
The talk starts with the immense opportunities for knowledge derived from data. I spent some time showing data systems ‘in the wild’ along with the appropriate algorithmic vocabulary (for example, amazon.com‘s ‘books you might like’ feature is a recommender system).
Once we can describe the problems properly, we can look for tools, and Python has many! Finally, in the fun part of the presentation, I demoed working code that uses NLTK to build a Twitter spam filter with 90% accuracy*.
Please let me know if you have questions or comments.
* I’ll post the code and training data shortly