I’m gathering a bundle of data science blogs to share. I’m looking to include blogs that update regularly and aren’t either personal opinion and project blogs (like this one) or primarily about marketing any particular company. Let me know if you have a favorite that I’ve forgotten.
If you’re just looking for one place to start, hop on over to Simply Statistics.
You should be speaking at conferences.
Not an extrovert? Great. Speaking is for introverts!
We go to conferences to meet people (and learn things from people and find opportunities… from people). Meeting people at events takes a lot of energy, especially if you don’t look like the average dude at a conference. You have to explain your story to every single person you talk to, listen to theirs, and try to see if you have overlapping interests. It’s inefficient and takes a lot of time.
By being a speaker, you can tell your story just once, to everyone, and the people who are excited about what you have to say will come find you. You will actually save energy if you get up on stage.
It’s a great hack.
Before you say, “fine, but I’m not good at speaking”, please take a look at this:
People who are way less intelligent than you give excellent talks every day (you might not agree with what they say, but do try to appreciate the skilled delivery). If they can learn to do it, you can learn to do it.
A few years ago, I decided to learn how to speak. I started by studying people whose techniques I admired, and distilling their techniques down into algorithms that I can understand and try to apply to my own presentations. I’m very much a student but have really enjoyed talking to people about giving talks, so I’m going to do an experiment and post one speaking hack per week here on my blog on Fridays. Let me know what you think.
It’s easy to believe that other people use social networks in the same way that you do. Your friends largely do use them the same way, which gives us an even more biased perspective.
Unfortunately, most networks don’t provide a way to explore representative communications that you’re not connected to.
Well, now you can! One random tweet, please.
Update: There were some slight technical difficulties due to hitting Twitter’s oembed rate limit. They should be repaired now.
(Note: between this and bookbookgoose.com I’m on a bit of a random kick lately. There’s a method to this madness!)
I ended up at NYC Resistor on Sunday, and decided to experiment with physical visualization of some data. I grabbed the clicks per second on keyphrases including my name (“hilary mason”) over the last six months, aggregated them by day, and made this graph:
This is easy enough to construct for any phrase using the clickrate data that we’re calculating at bitly. I exported it from matplotlib in svg, added a label, and used the laser-cutter to create this out of plywood:
…which will shortly be adorning my desk at work. This is very simple, but there’s a lot of fun to be had with the physical manifestation of patterns we see in large amount of ephemeral data.
I have a Google alert set up for my name, and over the weekend it sent me here.
Update: Bing has removed the page and now redirects to a regular search.
It’s a page on Bing Celebrities, merging my information with information about Hilary Mason, the (now deceased) British actress. According to this page, I have starred in movies before I was born and made videos after I died. It’s my photo and her filmography.
It’s creepy, but it’s also intriguing. How does this happen?
The data is credited to AMG and inbaseline, whose domain, though linked directly from Bing, does not resolve. Entity disambiguation is certainly a challenge, but I expect more from Microsoft, with so much data and so many brains.
This kind of error makes it extremely clear that identity is not a solved problem. I’ve written a bit about identity slippage before. And that people are especially sensitive to errors about themselves.
This isn’t the first time a search engine has confused me with the other Hilary Mason, except the first time was cuil (remember that?) and it was her photo as Ugly Hag and my bio. I’ll take it Bing’s way, thank you!
Last week I wrote a bit about how to share data with academics. This is the complimentary piece, on why you should invest the time and energy in sharing your data with the academic community.
As I was talking to people about this topic it became clear that there are really two different questions people ask. First, why do this at all? And second, what do I tell my boss?
Let’s start with the second one. This is what you should tell your boss:
- Academic research based on our work is a great press opportunity and demonstrates that credible people outside of our company find our work interesting.
- Having researchers work on our data is an easy way to access highly educated brainpower, for free, that in no way competes with us. Who knows what interesting stuff they’ll come up with?
- Personal relationships with university faculty are the absolute best way to recruit talent. If we invest a little bit of time in building a strong relationship with this professor, she’ll know the kind of people we’re looking for and send us her best students.
All of these points are valid, but they aren’t complete. As a startup, you’re mostly likely building a product at the intersection of a just-now-possible technology and a mostly-likely-ready market. The further the research in your field moves, the greater the number of possible futures for your company. Further, the greater the awareness of your type of technology in the community, the larger the market is likely to actually be.
Your company is one piece of a complex system, and the more robust that system becomes, the more possibilities there are for you. Share data, and you make the world a more interesting place in a direction that you’re interested in.
The introductory e-mail is a message where I introduce two (or more) people who have yet to meet each other. It generally takes the highly structured form, “Salutation A and B! A does X. B does Y. You should meet for reason Z. Valediction.”
I almost always do opt-in intros, where I’ll write to each party separately and make sure it’s okay if I share their information, and explain why I think it’s worth their time. I find this approach to be more respectful of people’s privacy and busy schedules.
That means that by the time they get the formal introduction, they generally know what’s going on. Still, I find these messages peculiarly stressful.
Stressful task? Check. Highly-structured output? Check. Repeating the same information over and over? Check. This calls for … a script!
You can grab the code on github here.
The first step is to set your valediction and names in the settings.py file, then to add people that you want to introduce with a brief description. Finally, you need only type something like:
python intro.py alan betty
to generate and copy to your clipboard (on a mac, anyway):
Alan & Betty, please meet. Alan is the fake director at fake company, where he does fake things. Betty is the fake person who does other fake interesting things. I think you'll find quite a lot to talk about. Cheers, Hilary
Paste it into your favorite e-mail client, send, and relax.
This is how I mentally organize introductions, but I have no idea if it’ll work for anyone else. Would you ever use something like this? What does it need to be useful for you?
This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list.
You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company).
The easiest way to share data is through your own API. Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad student gather the data) over the course of weeks. Using the API also has the side-effect of having the researchers bound only by your ToS.
When it’s not practical to use your API, create a data dump. At bitly, we have one open dataset, the 1.usa.gov data (clicks on 1.usa.gov links, stripped of all personally identifiable data). If I’m approached by students who are looking for a dataset for a homework project or someone who just wants to play, I point them there.
Often, though, researchers are investigating specific questions that require a specific sample of the data. We’ve worked with people investigating the spread of malware, studying the use of social networks during the Arab Spring, looking at how effective the media was during the Fukushima crisis in Japan, just for a few examples.
(Academics: every so often someone asks for “a copy of your database”. That’s not feasible for technical or business reasons. It’s more productive to describe your work and how you think the data would help. We’ll figure out what’s possible.)
When we do create and share a dataset, the data we share is always stripped of any potentially identifying information. Still, we require researchers to sign an NDA. The NDA basically states:
- You will not share this data with anyone.
This is not ideal for the research community, but necessary, according to our lawyers. We simply ask anyone who wants a copy of the same dataset for academic pursuits to sign the same NDA.
- You may publish whatever you like.
Academic freedom FTW.
- We reserve the right to use anything that you invent while using our data.
This is actually not because we expect people to come up with anything commercially useful to us that we haven’t already thought of, but to protect us should someone assert later that we used one of their ideas in a product.
And that’s it! To summarize, how to share data, in order of preference:
- Your API
- A public dataset
- A not-public custom-built and legally protected dataset
Now, go out, share, and research.
Data scientists need data, and good data is hard to find. I put together this bitly bundle of research quality data sets to collect as many useful data sets as possible in one place. The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition, so you know there’s something that will intrigue anyone.
Have one to add? Let me know!
(I’ve shared the bundle before, but this post can act as unofficial homepage for it.)
There must be a better way to explore books.
A random way to explore books would be a good way to start.
Hence, bookbookgoose. Browse randomly. Enjoy!
Hint: use the ‘n’ key to go forward quickly. I find about .2% of the books are awesome.
Update: you can now find @bookbookgoose on Twitter, sharing one random book per hour.
Update: Dustin Kurtz at Melville House had an eloquent writeup of the beauty in this random literature.