Startups: How to Share Data with Academics

This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list. :)

You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company).

The easiest way to share data is through your own API. Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad student gather the data) over the course of weeks. Using the API also has the side-effect of having the researchers bound only by your ToS.

When it’s not practical to use your API, create a data dump. At bitly, we have one open dataset, the 1.usa.gov data (clicks on 1.usa.gov links, stripped of all personally identifiable data). If I’m approached by students who are looking for a dataset for a homework project or someone who just wants to play, I point them there.

Often, though, researchers are investigating specific questions that require a specific sample of the data. We’ve worked with people investigating the spread of malware, studying the use of social networks during the Arab Spring, looking at how effective the media was during the Fukushima crisis in Japan, just for a few examples.

(Academics: every so often someone asks for “a copy of your database”. That’s not feasible for technical or business reasons. It’s more productive to describe your work and how you think the data would help. We’ll figure out what’s possible.)

When we do create and share a dataset, the data we share is always stripped of any potentially identifying information. Still, we require researchers to sign an NDA. The NDA basically states:

  1. You will not share this data with anyone.

    This is not ideal for the research community, but necessary, according to our lawyers. We simply ask anyone who wants a copy of the same dataset for academic pursuits to sign the same NDA.

  2. You may publish whatever you like.

    Academic freedom FTW.

  3. We reserve the right to use anything that you invent while using our data.

    This is actually not because we expect people to come up with anything commercially useful to us that we haven’t already thought of, but to protect us should someone assert later that we used one of their ideas in a product.

And that’s it! To summarize, how to share data, in order of preference:

  1. Your API
  2. A public dataset
  3. A not-public custom-built and legally protected dataset

Now, go out, share, and research.


  • http://twitter.com/edwelker Eddie Welker

    That’s great. I really appreciate your hard work being open and allowing access for reasons such as this.  Thanks!

  • http://twitter.com/kdnuggets Gregory Piatetsky

    Great post !  Is 1.usa.gov data suitable for a first course in data mining?

    • http://www.hilarymason.com Hilary Mason

      Yes, definitely, especially if there’s a focus on streaming data (there’s both a dump and a realtime stream available) or time-series analysis. The data is one record per click on 1.usa.gov URLs, including short and long URLs, timestamp, geo-location, user-agent and more.

  • http://pigsonthewing.org.uk Andy Mabbett / @pigsonthewing

    It would be great if you could stipulate that the results of any research conducted with your data must be published in an open access manner (i.e. not behind a paywall)

    • http://www.hilarymason.com Hilary Mason

      That’s a great idea.

  • David Y

    I’d love to hear your thoughts on how to convince higher-ups at your company (especially legal and other non-data folks) that it’s a good idea, or at least not a bad one, to share similar kinds of data.

    I’ve had to turn away academics even though I’ve made impassioned requests to share some of our data that wouldn’t have given anyone, even if it were leaked, anything secret or proprietary.

    Hoping that future post can help.

  • Pingback: » Startups: Why to Share Data with Academics hilarymason.com

  • Pingback: Sharing PyPi/Maven dependency data « RTFB