Data Engineering

Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system.

It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it.

Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when they ignore this research requirement.

Forget Table is one example of a data engineering project from our work at bitly. It’s a database for storing non-stationary categorical distributions. We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. Forget Table is  designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github).


  • Alan Klement

    Could you talk a bit about the differences of Data Engineering vs Data Science? I imagine it has to do with one being more about knowledge exploration through hypothesis testing and the other about figuring out how to create something from that knowledge.

    I’d like to know more how you see the two.

    • http://www.facebook.com/hsrivatsa Harsha Srivatsa

      I suppose that Data Engineering comes in for the design, engineering and implementing a Big Data processing platform that has dependencies of the type of data being handled. Data Science would then be the analysis of such data to generated insights, validate hypotheses, engrain with use cases such as sentiment analysis, recommendation engines etc. An analogy I can thing of is that of building a dam which depends on the landscape, amount of water flowing and flow charecteristics. Once the water is stored and managed, you can do neat things with it such as generate electricity, irrigation etc.

      • Alan Klement

        Doesn’t seem to match what Hilary is saying:

        “Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system”

        She puts the analysis before the engineering.

  • C

    Forget table looks like a really great concept. Simple and yet effective!!

  • Kevin Bretonnel Cohen

    I’ve been running into this recently in working with an ex-physicist who’s trying to get into natural language processing. None of his previous experience with data exploration really works for linguistic data, and it frustrates him and slows him down.