Data science and data engineering (should) go together like peas and carrots

· 3AG blog,data engineering

Image source: BuckandLibby

Wait—data science and data engineering aren’t the same?

We get this question a lot, both from data-heavy organizations and in casual conversation. And it makes some sense, until you start replacing “data” with other words. No one would confuse a chemist (someone with chemistry degrees) for a chemical engineer, or, say, an electrical engineer for either a physicist or an electrician! And no construction company would have the architect rather than a structural engineer finalize their blueprints.

In every other context, engineer connotes something very specific—so why is data engineering still being conflated with data science? It may be that data as an everyday, cross-industry concern is both still so new and is constantly evolving that we just lump all data professionals together for our own mental comfort. It’s understandable but the price is too high to continue this way. Simply put, not clearing up this confusion is really bad for business.

Understanding and harnessing both the differences and similarities between data science and data engineering can help your organization get ahead—instead of just get by. Because many companies and hiring managers don’t realize just how different these roles are, organizations may hire only data scientists and then wonder why they’re floundering around while other data-driven companies easily outpace them.

Having only data science resources may work out in the short term or with initiatives centered around ad-hoc analysis; but not only will your organization not thrive in the long run—your data scientists also won’t actually be able to do their jobs properly.

Data science unsupported by data engineering can create a nightmarish situation in which no-one can trust the data meant to fuel strategic projects and decision-making. Such lack of cooperation and data governance can worsen when organizations adopt new data science and engineering tools without having any or enough staff able to use them effectively. New tools automate many once manual activities, so data scientists and data engineers should be able to focus more on strategy. But if your data lacks integrity because you don’t have data engineers properly collecting and managing it, your strategy will be misguided at best and dangerously incomplete or inaccurate at worst.

Let’s lay out what data scientists do versus what data engineers do. Then, we’ll outline how they can—and absolutely should—work together for your data-driven organization’s long-term success.

Data science defined

Data science is evolving quickly, so defining it can be tricky—and perhaps not desirable. Its ongoing development is both what makes data science powerful and vulnerable to misuse or misunderstanding. Misunderstanding is part of the problem, however, so defining what we can is just sensible:

Basically, data science extracts insights from data. Such insight generally exists in three forms: descriptive (what has occurred), predictive (what could or will occur), and prescriptive (how to respond to what is discovered). (This distinguishes data science from business intelligence, the latter only analyzing past events). Data science draws knowledge from both structured and unstructured data; it is central to big data and data mining activities, both of which constitute primary tools for data-driven organizations.

Data science has become what Turing award winner Jim Gray calls the “fourth paradigm" of science (adding data-driven to the original three, empirical, theoretical, and computational). In 2015, the American Statistical Association designated machine learning and statistics, database management, and distributed and parallel systems to be growing, foundational professional groups—all data science sub-disciplines.

Data science’s practical applications

Data mining (turning detected data patterns into business insights) and predictive analytics (determining the statistical chances of certain events occurring) are data scientists’ most important skills. Other crucial operations include:

  • building Python packages
  • placing R in production
  • increasing Spark job efficiency
  • version-controlling data and SQL
  • ensuring models and data can be reproduced
  • storing and maintaining clean data in data lakes and other repositories
  • forecasting time series at scale
  • sharing Jupyter notebooks at scale
  • JSON
  • SAS
  • storytelling and data visualization
  • Hadoop

Today’s data scientists have to be more than technical experts, though. They also need to be able to identify what questions their organizations should be asking, as well as how to source the data required to answer them.

Organizations need to be able to rely on their data scientists to accurately find, manage, and analyze unstructured data en masse; these findings should then be synthesized into easily understood formats for company stakeholders using them to make strategic business decisions.

Data science continues to change

Data science is crucial to most businesses already; how it’s important may change but this fact will not. Data science is already quickly integrating with artificial intelligence, which connects related data for later use by detecting meaningful patterns in large collections of data. But while increasingly useful and complex, AI isn’t yet a real match for human intelligence. AI’s still require massive amounts of data to perform even simple tasks, like editing letters.