Data: first and last names from the US CensusPosted: October 16, 2009 | Author: hilary | Filed under: blog | Tags: data, dataset, mysql, sql | 1 Comment »
I’ve found myself in need of a name distribution for a few projects recently, so I thought I would post it here so I won’t have to go looking for it again.
The data is available from the US Census Bureau (from 1990 census) here, and I have it here in a friendly MySQL *.sql format (it will create the tables and insert the data). There are three tables: male first names, female first names, and surnames.
I’ve noted several issues in the data that are likely the result of typos, so make sure to do your own validation if your application requires it.
The format is simple:
- the name
- frequency (percentage of people in the sampled population with that name)
- cumulative frequency (as you read down the list, the percentage of total population covered)
If you want to use this to generate a random name, you can do so very easily with a query like this:
SELECT name FROM ref_census_surnames n ORDER BY (RAND() * (n.freq + .01)) LIMIT 0,1;
Download it here: census_names.tar.gz