Startups: How to Share Data with Academics

This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list. :)

You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company).

The easiest way to share data is through your own API. Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad student gather the data) over the course of weeks. Using the API also has the side-effect of having the researchers bound only by your ToS.

When it’s not practical to use your API, create a data dump. At bitly, we have one open dataset, the data (clicks on links, stripped of all personally identifiable data). If I’m approached by students who are looking for a dataset for a homework project or someone who just wants to play, I point them there.

Often, though, researchers are investigating specific questions that require a specific sample of the data. We’ve worked with people investigating the spread of malware, studying the use of social networks during the Arab Spring, looking at how effective the media was during the Fukushima crisis in Japan, just for a few examples.

(Academics: every so often someone asks for “a copy of your database”. That’s not feasible for technical or business reasons. It’s more productive to describe your work and how you think the data would help. We’ll figure out what’s possible.)

When we do create and share a dataset, the data we share is always stripped of any potentially identifying information. Still, we require researchers to sign an NDA. The NDA basically states:

  1. You will not share this data with anyone.

    This is not ideal for the research community, but necessary, according to our lawyers. We simply ask anyone who wants a copy of the same dataset for academic pursuits to sign the same NDA.

  2. You may publish whatever you like.

    Academic freedom FTW.

  3. We reserve the right to use anything that you invent while using our data.

    This is actually not because we expect people to come up with anything commercially useful to us that we haven’t already thought of, but to protect us should someone assert later that we used one of their ideas in a product.

And that’s it! To summarize, how to share data, in order of preference:

  1. Your API
  2. A public dataset
  3. A not-public custom-built and legally protected dataset

Now, go out, share, and research.