Startups: How to Share Data with Academics

This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list. :)

You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company).

The easiest way to share data is through your own API. Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad student gather the data) over the course of weeks. Using the API also has the side-effect of having the researchers bound only by your ToS.

When it’s not practical to use your API, create a data dump. At bitly, we have one open dataset, the data (clicks on links, stripped of all personally identifiable data). If I’m approached by students who are looking for a dataset for a homework project or someone who just wants to play, I point them there.

Often, though, researchers are investigating specific questions that require a specific sample of the data. We’ve worked with people investigating the spread of malware, studying the use of social networks during the Arab Spring, looking at how effective the media was during the Fukushima crisis in Japan, just for a few examples.

(Academics: every so often someone asks for “a copy of your database”. That’s not feasible for technical or business reasons. It’s more productive to describe your work and how you think the data would help. We’ll figure out what’s possible.)

When we do create and share a dataset, the data we share is always stripped of any potentially identifying information. Still, we require researchers to sign an NDA. The NDA basically states:

  1. You will not share this data with anyone.

    This is not ideal for the research community, but necessary, according to our lawyers. We simply ask anyone who wants a copy of the same dataset for academic pursuits to sign the same NDA.

  2. You may publish whatever you like.

    Academic freedom FTW.

  3. We reserve the right to use anything that you invent while using our data.

    This is actually not because we expect people to come up with anything commercially useful to us that we haven’t already thought of, but to protect us should someone assert later that we used one of their ideas in a product.

And that’s it! To summarize, how to share data, in order of preference:

  1. Your API
  2. A public dataset
  3. A not-public custom-built and legally protected dataset

Now, go out, share, and research.

Conference: Search and Social Media 2010

I recently attended the Third Annual Workshop on Search and Social Media, an academic workshop with very strong industry participation. The workshop was packed, and had some of the most informative and interesting panel discussions I’ve seen (not counting the one I spoke on!).

Daniel Tunkelang did a great job of writing up the specific presentations on his site and on the ACM blog, so I won’t attempt to re-create the presentations line by line at this late date. Rather, I’d like to highlight a few open problems and research questions that came out of the discussions that I hope to see developed in the next year.

Social search consists of a set of problems including (but hardly limited to) search of social content like status updates, real-time search, generating, labeling, and finding user-generated content, ‘long-tail’ events and interests, finding vs re-finding, and trend identification.

What data is available to social search? There are many kinds of social data, from e-mail (private) to blogs (public) and tweets (mostly public) — what is and should be searchable? How do we handle issues of privacy and identity management?

How do we compute relevance, taking into account freshness, accuracy, and degrees of social separation?

Will the architecture of these search engines look like the search engines we’re currently familiar with?

How do we evaluate accuracy and truthiness of social data?

How do we characterize social connections, around concepts like strong vs weak ties, and friend-of-a-friend vs friend-of-a-friend’s-friend? Can we converge on a single social graph representation?

How do we best filter social data to lead to accurate recommendations for content discovery? How do we accommodate the fact that as we move beyond static factual data, two people using the same query may be looking for very different results?

Finally, how do we deal with the chasm between the industry participants (who have LOTS of data) and the academic participants, who suffer from a lack of public (and publishable) data?

Thanks again to the organizers – Eugene Agichtein, Marti Hearst, Ian Soboroff, and Daniel Tunkelang – who put together a fantastic event.

For more on this and a cool demo, check out Gene Golovchinsky‘s look at the SSM2010 twitter coverage.