Getting Started with Data Science
Posted: December 28, 2012 | Author: Hilary Mason | Filed under: blog | Tags: advice, datascience, hacking, learning | 16 Comments »I get quite a few e-mail messages from very smart people who are looking to get started in data science. Here’s what I usually tell them:
The best way to get started in data science is to DO data science!
First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.
Second, get to know other data scientists! If you’re in New York, try the DataGotham events list to find some meetups, and make sure to stay for the beers. Look for groups, like DataKind, that need data skills put to work for good. No matter how much of a beginner you might be, your enthusiasm will be appreciated, you’ll learn things, and you’ll meet great people. And if you can’t find a physical meetup close to you, start one, or join the twitter discussion.
Third, put your projects out in public. Share them on Github, your blog, and Twitter. Explain why you thought the question was interesting, where you got the data (and good data is everywhere), and how you came to a conclusion. It doesn’t have to be perfect. A couple examples of data projects motivated by nothing more than the author’s curiosity are Yvo’s TechCrunch analysis and Drew and John’s Ranking the Popularity of Programming Languages.
Finally, you can start right here. What advice do you give? What great projects have you seen lately? Share them in the comments.
Where’s the API that can tell me that this photo contains a puppy and a can of Coke?
Posted: November 5, 2012 | Author: Hilary Mason | Filed under: blog | Tags: api | 18 Comments »We’ve gotten very good at extracting and disambiguation entities from text data. You can license a commodity system, and there are API and even open source tools that work fairly well.
However, a large percentage of content that people share is not primarily text (a back-of-the-envelope guess says around 18%), and we currently have very little automated insight into that content.
I know this is a very hard problem, but I’m continuously surprised by how few people seem to be working on it. Any ideas?
Help, I’m the first data scientist at my company!
Posted: September 21, 2012 | Author: Hilary Mason | Filed under: blog | Tags: datagotham, datascience, panel, presentations | 2 Comments »I moderated a panel at DataGotham with Adam Laiacano from Tumblr, Fred Benenson from Kickstarter, and Roberto Medri from Etsy about being the first data scientist at a company. We covered everything from what people’s job responsibilities are, the tools they use, successes, failures, how they are integrated into an organization, and how they have hired other data scientists to join them. The panelists were concise, articulate, and intelligent. Watch it below!
Hey Yahoo, You’re Optimizing the Wrong Thing
Posted: September 18, 2012 | Author: Hilary Mason | Filed under: blog | Tags: data, design, product, yahoo | 26 Comments »I was visiting my grandparents yesterday, and my grandfather asked for help e-mailing an article to some of his friends. I asked him to show me how he normally writes an e-mail, and taught him the magic of copy and paste (it is amazing if you haven’t seen it before) but I noticed that in the course of sending an e-mail and checking on his inbox, he clicked on this ad three times.

When I asked about it, he didn’t realize he had clicked the ad — he just thought these screens popped up randomly — because he didn’t realize that his hands were shaking on the trackpad.
I’m sure the data says that that’s the optimal place on the screen for the ad. I’m sure tons of people ‘click’ on it. I’m also sure it’s wrong, and it results in a terrible experience.
It’s common sense, but experiences like this are great reminders that data only takes us so far, and creativity and clear thinking are always required to find the best solutions.
Yahoo, please fix this!
How do you prioritize research?
Posted: August 28, 2012 | Author: Hilary Mason | Filed under: blog | Tags: datascience, startups | 14 Comments »One of the most fun and challenging parts of my job is setting bitly’s research agenda. We’re a startup, so this means prioritizing the set of questions we look into in the context of what will be most beneficial for the rest of the business, for the short and long-term, by creating opportunity and opening up potential futures. We work on a wide variety of projects, from pure research to press collaborations to infrastructure and experimental products.
We always have a list of research questions way longer than we have time and resources to pursue, so we developed a process for evaluating whether a given question is worth pursuing at a particular time.
This is the kind of process that I’ve only discussed with several people over whisky (thanks!), but not seen written up. I initially had a much longer list of questions but have decided to keep it as simple as possible, to frame a discussion but not dictate or burden it. I hope it’s helpful and I would love to hear about other appproaches.
For each research question that we might look into, we ask the following:
- State the research question.
- How do we know when we’ve won?
- Assume we’ve solved this question perfectly. What are the first things that we’ll build with it?
- If everyone in the world uses this, how does it change human behavior?
- What’s the most evil thing that can be done with this?
State the research question.
It’s important to state the question in language that everyone can understand. The bitly team comes from a variety of scientific and business backgrounds, and we’ve developed some of our own common vocabulary, but it still takes a bit of effort to make sure that everyone understand the fundamental challenge and why it’s interesting.
How do we know when we’ve won?
Here we define the metrics that we’ll use to measure our success. For some questions, this is obvious, and for others it’s impossible to define — we can at least acknowledge that ahead of time.
Assume we’ve solved this question perfectly. What are the first things that we’ll build with it?
This question allow us to assess the potential business and product impact. What capabilities will we have with this that we don’t have now? It allows us to keep the long-term research vision in mind while still optimizing for shorter-term opportunities.
If everyone in the world uses this, how does it change human behavior?
What’s the maximum potential impact of this work? If it’s not inspiring, is it worth pursuing at all?
What’s the most evil thing that can be done with this?
I don’t ask this question to encourage evil (>:]) but as a creative tool for expanding how we think about validity, impact, and potential applications of the research. The label evil is so ridiculous that it permits people share their craziest ideas. Plus, it’s always a fun conversation to have.
Finally
I’m always revising this list, and I would love to hear how you think about prioritizing your work.
DataGotham: The Empire State of Data
Posted: August 22, 2012 | Author: Hilary Mason | Filed under: blog, projects | 2 Comments »I’m extremely excited about DataGotham, a conference that I’m co-hosting with friends and fellow New York data nerds Drew, John, and Mike.
DataGotham is a celebration of the NYC data community, and will bring together professionals from all industries in New York that are built around data, from finance to fashion and from startups to the Fortune 500 and government. The event is September 13th – 14th at NYU, with tutorials and The Great Data Extravaganza Show (with cocktails!) at the Tribeca Rooftop Thursday evening, and a single track conference Friday. Our speakers and sponsors are all amazing. You can register now.
While DataGotham is definitely a labor of love, there are numerous reasons to do it. I believe that New York has a distinct data philosophy — the study of human behavior — that is unique and should be celebrated. We have an large population of local badass data hackers, and our community will only grow stronger if we can build relationships across the industry divides. Finally, there’s an opportunity for all of us to influence the future of data science, and this event will highlight some voices that might not otherwise be heard.
I hope to see you there!
(Also, anyone who made it this far through can register with code “dataGothamist” for 25% off
)
Why I love New York City
Posted: August 19, 2012 | Author: Hilary Mason | Filed under: blog | 14 Comments »New York is infinite.
A human can only explore a place at a particular speed. The rate of change in New York exceeds the rate at which a person can possibly experience the city, and so it is impossible to run out of city to experience.
New York is a neighborhood.
At the same time, New York is a mosaic of wonderful little neighborhoods. What many visitors miss and all residents know is that you rarely have to walk more than a few blocks from home for any of life’s essentials, and enough people do the same that you find yourself saying hello.
New York is chaotic.
You are never the weirdest thing you see. The city will give you things to think about, and more. It is never boring, and you cannot take it for granted.
New York is opportunity.
Everyone comes through New York, eventually. Whatever food, material goods, or unexpcted experiences you look for can be found or made here, including Japanese Hotdogs, Cuban Chinese restaurants, salons only for curly-haired girls, and spontaneous art and even opera.
But what, no tech scene?
New York does have a thriving tech community full of wonderful people that I’m excited to work alongside for many years, but I fell in love with the city long before it belonged to us.
BTW, if you’ve seen Avengers or Batman, you know that no city is destroyed so beautifully as ours.
Identity Slippage, and what’s the weirdest thing you’ve been e-mailed by accident?
Posted: January 26, 2012 | Author: Hilary Mason | Filed under: blog | 31 Comments »I have an old, short, and concise gmail address (my first initial and last name at gmail.com). There are many other hmasons in the world who have since signed up for gmail, with variations on the “hmason” theme. Every so often, they mistype the address, or someone mishears it. I now receive between four and ten pieces of e-mail per week meant for other hmasons. This was pretty amusing until someone opened an amazon account on that address (which I had to shut down). Poor Holly has never seen a single Citibank credit card statement (and Citibank won’t remove the e-mail address from the account when I call, since I’m not the account holder). Heidi hasn’t linked her Paypal account to her bank account, but I’m waiting for someone to send her money.
This sort of unwitting misattribution results in an identity slippage that could actually have some fairly interesting consequences. We’ve settled on e-mail as a unique identifier across platforms, but we increasingly cannot rely on that assumption.
I saw Chris Adam‘s comment on Twitter this morning and can’t agree more — it should become standard practice to confirm an e-mail address before sending personally identifiable or sensitive data. Now, please.
Going to Strata in Feb?
Posted: January 12, 2012 | Author: Hilary Mason | Filed under: blog | Tags: conferences, quick | 1 Comment »Are you planning to attend Strata in Santa Clara at the end of February? Reach out for a discount registration code.
Why do I miss google calendar invites?
Posted: January 2, 2012 | Author: Hilary Mason | Filed under: blog | Tags: calendar, configuration, google | 2 Comments »I keep missing Google calendar invites on both my personal and work accounts. I’ve had my google account for years (since 2004?) and assumed it was some quirk of how I had configured something along the way.
Today I was following Google’s instructions for syncing calendars with an iOS device and discovered that if you click calendar settings (which means click the gear icon then ‘calendar settings’), then ‘calendar’, then ‘notifications’ next to the calendar that you care about, you can turn on e-mail and SMS notifications for any given calendar.
(I’ll save my ranting about the number of clicks to find and configuration anything on google’s properties right now for another time.)
I’m sharing this on the theory that I’m not the only one with this particular frustration. I hope it saves someone from missed opportunities and useless rage!



