Data EngineeringPosted: April 1, 2013 | Author: Hilary Mason | Filed under: blog | Tags: bitly, data, engineering, infrastructure | 5 Comments »
Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.
It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.
Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.
Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).