Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.
It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.
Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.
Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).
We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.
First, we share the analysis that we do at the link level. Every developer using data from the web has the same set of problems — what are the topics of those URLs? What are their keywords? Why should you rebuild this infrastructure when we’ve done it already? We’ve also added in a few bits of bitly magic — for example, you can use the /v3/link/location endpoint to see where in the world people are consuming that information from.
Second, we’ve opened up access to a realtime search engine. That’s an actual search engine that returns results ranked by current attention and popularity. Links are only retained for 24 hours, so you know that anything you see is actively receiving attention. If you think of bitly as a stream of stories that people are paying attention to, this search API offers you the ability to filter the stream by criteria like domain, topic, or location (“food” links from Brooklyn is one of my favorites) and pull out the content, in realtime, that meets your criteria. You can test it out with a human-friendly interface at rt.ly.
Finally, we asked the question — what is the world paying attention to right now? We have a system that tracks the rate of clicks – a proxy for attention – on phrases contained within the URLs being clicked through bitly. Then we can look and see which phrases are currently receiving a disproportionate amount of attention. We call these “bursting phrases”, and you can access them with the /v3/realtime/bursting_phrases endpoint. It’s analogous to Twitter’s trending topics, but based on attention (what people do), not shares (what they say), and across the entire social web.
I’m extremely excited to see what people build with these tools.
I gave a talk called A Data-driven Look at the Realtime Web Ecosystem at the Web2Expo SF conference in May in San Francisco. I attempted to highlight some of the interesting facets of the bit.ly data set, and it appeared to be well-received (showing up on TechCrunch, ZDNet, and a few other places).
I attended the full conference, and it was great. The attendees were extremely international and I met a ton of fascinating people.
I’m still getting a couple of e-mail requests per week for my slides and materials, so they’re posted below for posterity.
And the video:
As always, I welcome your questions or comments.