Data scientists need data, and good data is hard to find. I put together this bitly bundle of research quality data sets to collect as many useful data sets as possible in one place. The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition, so you know there’s something that will intrigue anyone.
Have one to add? Let me know!
(I’ve shared the bundle before, but this post can act as unofficial homepage for it.)
We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.
First, we share the analysis that we do at the link level. Every developer using data from the web has the same set of problems — what are the topics of those URLs? What are their keywords? Why should you rebuild this infrastructure when we’ve done it already? We’ve also added in a few bits of bitly magic — for example, you can use the /v3/link/location endpoint to see where in the world people are consuming that information from.
Second, we’ve opened up access to a realtime search engine. That’s an actual search engine that returns results ranked by current attention and popularity. Links are only retained for 24 hours, so you know that anything you see is actively receiving attention. If you think of bitly as a stream of stories that people are paying attention to, this search API offers you the ability to filter the stream by criteria like domain, topic, or location (“food” links from Brooklyn is one of my favorites) and pull out the content, in realtime, that meets your criteria. You can test it out with a human-friendly interface at rt.ly.
Finally, we asked the question — what is the world paying attention to right now? We have a system that tracks the rate of clicks – a proxy for attention – on phrases contained within the URLs being clicked through bitly. Then we can look and see which phrases are currently receiving a disproportionate amount of attention. We call these “bursting phrases”, and you can access them with the /v3/realtime/bursting_phrases endpoint. It’s analogous to Twitter’s trending topics, but based on attention (what people do), not shares (what they say), and across the entire social web.
I’m extremely excited to see what people build with these tools.
I’ve found myself in need of a name distribution for a few projects recently, so I thought I would post it here so I won’t have to go looking for it again.
The data is available from the US Census Bureau (from 1990 census) here, and I have it here in a friendly MySQL *.sql format (it will create the tables and insert the data). There are three tables: male first names, female first names, and surnames.
I’ve noted several issues in the data that are likely the result of typos, so make sure to do your own validation if your application requires it.
The format is simple:
- the name
- frequency (percentage of people in the sampled population with that name)
- cumulative frequency (as you read down the list, the percentage of total population covered)
If you want to use this to generate a random name, you can do so very easily with a query like this:
SELECT name FROM ref_census_surnames n ORDER BY (RAND() * (n.freq + .01)) LIMIT 0,1;
Download it here: census_names.tar.gz