Need Data? Start Here

Data scientists need data, and good data is hard to find. I put together this bitly bundle of research quality data sets to collect as many useful data sets as possible in one place. The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition, so you know there’s something that will intrigue anyone.

Have one to add? Let me know!

(I’ve shared the bundle before, but this post can act as unofficial homepage for it.)


  • http://twitter.com/dip4fish Jean-Patrick Pommier

    Thank you,
    Multispectral cytogenetic images (MFISH) are available here https://github.com/jeanpat/MFISH

  • http://blog.grapesmoker.com/ Jerry Vinokurov

    Nice!

  • http://twitter.com/amy8492 Amy

    Yeah, thanks.I’ll probably go for the 2gb of cats when i’ll have time :-).

  • what_i_am_thinking_rightnow

    Hello Miss Hilary Mason, What book would you recommend for complete newbies to get started in Data Science and Predictive Analytics?

    Thank You!

  • http://pafnuty.wordpress.com/ Aman

    Thanks for this list, Hilary. 

    I started my own recently (http://eda.fenristech.com/PublicDataSets) and will have to add to that a couple from your list that are especially interesting to me. :) 

  • Jonathan Cachat

    I am curious what you would do with massive collection of heterogenous scientific data – say http://www.neuinfo.org

  • http://partiallattice.wordpress.com/ Daniel Smith

    A source of health expenditure data is the MEPS survey:

    http://meps.ahrq.gov/mepsweb/

  • Arturo

    The main source of data on microfinance (financial and social indicators) is http://www.mixmarket.org  No password is required,

  • datapants

    Hi Hilary. How ’bout collecting all your datasets and wearing them in your datapants. Would you be interested in the domain name datapants.com,
    databra.com or
    databadge.com? alan@nothing.con

  • Guest

    I am going to use that belly button biodiversity dataset!

  • http://www.johnyetter.com/ John Yetter

    I am going to use that belly button biodiversity dataset! I will probably take a look at the loan dataset, too, but it is not as fun.

  • kekline

    Very nice! I also point people to this post on Quora where a ton of public, governmental and NGO data sets cataloged: http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public.