Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.
It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.
Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.
Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).
Google Now is an extension to Google’s Android search app that uses all of the data that Google has about you along with what it can guess about your current context to present the information it thinks you need when it thinks you need it.
It’ll tell you to leave a bit early to make your next calendar event because of heavy traffic, or that it’s a friend’s birthday, or that there’s a cool cafe nearby where you are.
I think it’s amazing.
It’s amazing because this is the first Google product that takes ALL OF THE DATA that they have about us and actually makes it useful for us. Not for advertisers.
I’m gathering a bundle of data science blogs to share. I’m looking to include blogs that update regularly and aren’t either personal opinion and project blogs (like this one) or primarily about marketing any particular company. Let me know if you have a favorite that I’ve forgotten.
If you’re just looking for one place to start, hop on over to Simply Statistics.
I ended up at NYC Resistor on Sunday, and decided to experiment with physical visualization of some data. I grabbed the clicks per second on keyphrases including my name (“hilary mason”) over the last six months, aggregated them by day, and made this graph:
This is easy enough to construct for any phrase using the clickrate data that we’re calculating at bitly. I exported it from matplotlib in svg, added a label, and used the laser-cutter to create this out of plywood:
…which will shortly be adorning my desk at work. This is very simple, but there’s a lot of fun to be had with the physical manifestation of patterns we see in large amount of ephemeral data.
I have a Google alert set up for my name, and over the weekend it sent me here.
Update: Bing has removed the page and now redirects to a regular search.
It’s a page on Bing Celebrities, merging my information with information about Hilary Mason, the (now deceased) British actress. According to this page, I have starred in movies before I was born and made videos after I died. It’s my photo and her filmography.
It’s creepy, but it’s also intriguing. How does this happen?
The data is credited to AMG and inbaseline, whose domain, though linked directly from Bing, does not resolve. Entity disambiguation is certainly a challenge, but I expect more from Microsoft, with so much data and so many brains.
This kind of error makes it extremely clear that identity is not a solved problem. I’ve written a bit about identity slippage before. And that people are especially sensitive to errors about themselves.
This isn’t the first time a search engine has confused me with the other Hilary Mason, except the first time was cuil (remember that?) and it was her photo as Ugly Hag and my bio. I’ll take it Bing’s way, thank you!
Last week I wrote a bit about how to share data with academics. This is the complimentary piece, on why you should invest the time and energy in sharing your data with the academic community.
As I was talking to people about this topic it became clear that there are really two different questions people ask. First, why do this at all? And second, what do I tell my boss?
Let’s start with the second one. This is what you should tell your boss:
- Academic research based on our work is a great press opportunity and demonstrates that credible people outside of our company find our work interesting.
- Having researchers work on our data is an easy way to access highly educated brainpower, for free, that in no way competes with us. Who knows what interesting stuff they’ll come up with?
- Personal relationships with university faculty are the absolute best way to recruit talent. If we invest a little bit of time in building a strong relationship with this professor, she’ll know the kind of people we’re looking for and send us her best students.
All of these points are valid, but they aren’t complete. As a startup, you’re mostly likely building a product at the intersection of a just-now-possible technology and a mostly-likely-ready market. The further the research in your field moves, the greater the number of possible futures for your company. Further, the greater the awareness of your type of technology in the community, the larger the market is likely to actually be.
Your company is one piece of a complex system, and the more robust that system becomes, the more possibilities there are for you. Share data, and you make the world a more interesting place in a direction that you’re interested in.
This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list.
You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company).
The easiest way to share data is through your own API. Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad student gather the data) over the course of weeks. Using the API also has the side-effect of having the researchers bound only by your ToS.
When it’s not practical to use your API, create a data dump. At bitly, we have one open dataset, the 1.usa.gov data (clicks on 1.usa.gov links, stripped of all personally identifiable data). If I’m approached by students who are looking for a dataset for a homework project or someone who just wants to play, I point them there.
Often, though, researchers are investigating specific questions that require a specific sample of the data. We’ve worked with people investigating the spread of malware, studying the use of social networks during the Arab Spring, looking at how effective the media was during the Fukushima crisis in Japan, just for a few examples.
(Academics: every so often someone asks for “a copy of your database”. That’s not feasible for technical or business reasons. It’s more productive to describe your work and how you think the data would help. We’ll figure out what’s possible.)
When we do create and share a dataset, the data we share is always stripped of any potentially identifying information. Still, we require researchers to sign an NDA. The NDA basically states:
- You will not share this data with anyone.
This is not ideal for the research community, but necessary, according to our lawyers. We simply ask anyone who wants a copy of the same dataset for academic pursuits to sign the same NDA.
- You may publish whatever you like.
Academic freedom FTW.
- We reserve the right to use anything that you invent while using our data.
This is actually not because we expect people to come up with anything commercially useful to us that we haven’t already thought of, but to protect us should someone assert later that we used one of their ideas in a product.
And that’s it! To summarize, how to share data, in order of preference:
- Your API
- A public dataset
- A not-public custom-built and legally protected dataset
Now, go out, share, and research.
We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.
First, we share the analysis that we do at the link level. Every developer using data from the web has the same set of problems — what are the topics of those URLs? What are their keywords? Why should you rebuild this infrastructure when we’ve done it already? We’ve also added in a few bits of bitly magic — for example, you can use the /v3/link/location endpoint to see where in the world people are consuming that information from.
Second, we’ve opened up access to a realtime search engine. That’s an actual search engine that returns results ranked by current attention and popularity. Links are only retained for 24 hours, so you know that anything you see is actively receiving attention. If you think of bitly as a stream of stories that people are paying attention to, this search API offers you the ability to filter the stream by criteria like domain, topic, or location (“food” links from Brooklyn is one of my favorites) and pull out the content, in realtime, that meets your criteria. You can test it out with a human-friendly interface at rt.ly.
Finally, we asked the question — what is the world paying attention to right now? We have a system that tracks the rate of clicks – a proxy for attention – on phrases contained within the URLs being clicked through bitly. Then we can look and see which phrases are currently receiving a disproportionate amount of attention. We call these “bursting phrases”, and you can access them with the /v3/realtime/bursting_phrases endpoint. It’s analogous to Twitter’s trending topics, but based on attention (what people do), not shares (what they say), and across the entire social web.
I’m extremely excited to see what people build with these tools.
Great data scientists come from such diverse backgrounds that it can be difficult to get a sense of whether someone is up to the job in just a short interview. In addition to the technical questions, I find it useful to have a few questions that draw out the more creative and less discrete elements of a candidate’s personality. Here are a few of my favorite questions.
- What was the last thing that you made for fun?
This is my favorite question by far — I want to work with the kind of people who don’t turn their brains off when they go home. It’s also a great way to learn what gets people excited.
- What’s your favorite algorithm? Can you explain it to me?
I don’t know any data scientists who haven’t fallen in love with an algorithm, and I want to see both that enthusiasm and that the candidate can explain it to a knowledgable audience.
Update: As Drew pointed out on Twitter, do be aware of hammer syndrome: when someone falls so in love with one algorithm that they try to apply it to everything, even when better choices are available.
- Tell me about a data project you’ve done that was successful. How did you add unique value?
This is a chance for the candidate to walk us through a success and show off a bit. It’s also a great gateway into talking about their process and preferred tools and experience.
- Tell me about something that failed. What would you change if you had to do it over again?
This is a tricky question, and sometimes it takes people a few tries to get to a complete answer. It’s worth asking, though, to see that people have the confidence to talk about something that went awry, and the wisdom to have recognized when something they did was not optimal.
- You clearly know a bit about our data and our work. When you look around, what’s the first thing that comes to mind as “why haven’t you done X”?!
Technical competence is useless without the creativity to know where to focus it. I love when people come in with questions and ideas.
- What’s the best interview question anyone has ever asked you?
I’d like to wish for more wishes, please.
I’m always looking for new and interesting things to add to my list, and I’d love to hear your suggestions.
I get quite a few e-mail messages from very smart people who are looking to get started in data science. Here’s what I usually tell them:
The best way to get started in data science is to DO data science!
First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.
Second, get to know other data scientists! If you’re in New York, try the DataGotham events list to find some meetups, and make sure to stay for the beers. Look for groups, like DataKind, that need data skills put to work for good. No matter how much of a beginner you might be, your enthusiasm will be appreciated, you’ll learn things, and you’ll meet great people. And if you can’t find a physical meetup close to you, start one, or join the twitter discussion.
Third, put your projects out in public. Share them on Github, your blog, and Twitter. Explain why you thought the question was interesting, where you got the data (and good data is everywhere), and how you came to a conclusion. It doesn’t have to be perfect. A couple examples of data projects motivated by nothing more than the author’s curiosity are Yvo’s TechCrunch analysis and Drew and John’s Ranking the Popularity of Programming Languages.
Finally, you can start right here. What advice do you give? What great projects have you seen lately? Share them in the comments.