Return to site

Data science and data engineering (should) go together like peas and carrots

· 3AG blog,data engineering

Image source: BuckandLibby

Wait—data science and data engineering aren’t the same?

We get this question a lot, both from data-heavy organizations and in casual conversation. And it makes some sense, until you start replacing “data” with other words. No one would confuse a chemist (someone with chemistry degrees) for a chemical engineer, or, say, an electrical engineer for either a physicist or an electrician! And no construction company would have the architect rather than a structural engineer finalize their blueprints.

In every other context, engineer connotes something very specific—so why is data engineering still being conflated with data science? It may be that data as an everyday, cross-industry concern is both still so new and is constantly evolving that we just lump all data professionals together for our own mental comfort. It’s understandable but the price is too high to continue this way. Simply put, not clearing up this confusion is really bad for business.

Understanding and harnessing both the differences and similarities between data science and data engineering can help your organization get ahead—instead of just get by. Because many companies and hiring managers don’t realize just how different these roles are, organizations may hire only data scientists and then wonder why they’re floundering around while other data-driven companies easily outpace them.

Having only data science resources may work out in the short term or with initiatives centered around ad-hoc analysis; but not only will your organization not thrive in the long run—your data scientists also won’t actually be able to do their jobs properly.

Data science unsupported by data engineering can create a nightmarish situation in which no-one can trust the data meant to fuel strategic projects and decision-making. Such lack of cooperation and data governance can worsen when organizations adopt new data science and engineering tools without having any or enough staff able to use them effectively. New tools automate many once manual activities, so data scientists and data engineers should be able to focus more on strategy. But if your data lacks integrity because you don’t have data engineers properly collecting and managing it, your strategy will be misguided at best and dangerously incomplete or inaccurate at worst.

Let’s lay out what data scientists do versus what data engineers do. Then, we’ll outline how they can—and absolutely should—work together for your data-driven organization’s long-term success.

Data science defined

Data science is evolving quickly, so defining it can be tricky—and perhaps not desirable. Its ongoing development is both what makes data science powerful and vulnerable to misuse or misunderstanding. Misunderstanding is part of the problem, however, so defining what we can is just sensible:

Basically, data science extracts insights from data. Such insight generally exists in three forms: descriptive (what has occurred), predictive (what could or will occur), and prescriptive (how to respond to what is discovered). (This distinguishes data science from business intelligence, the latter only analyzing past events). Data science draws knowledge from both structured and unstructured data; it is central to big data and data mining activities, both of which constitute primary tools for data-driven organizations.

Data science has become what Turing award winner Jim Gray calls the “fourth paradigm" of science (adding data-driven to the original three, empirical, theoretical, and computational). In 2015, the American Statistical Association designated machine learning and statistics, database management, and distributed and parallel systems to be growing, foundational professional groups—all data science sub-disciplines.

Data science’s practical applications

Data mining (turning detected data patterns into business insights) and predictive analytics (determining the statistical chances of certain events occurring) are data scientists’ most important skills. Other crucial operations include:

  • building Python packages
  • placing R in production
  • increasing Spark job efficiency
  • version-controlling data and SQL
  • ensuring models and data can be reproduced
  • storing and maintaining clean data in data lakes and other repositories
  • forecasting time series at scale
  • sharing Jupyter notebooks at scale
  • JSON
  • SAS
  • storytelling and data visualization
  • Hadoop

Today’s data scientists have to be more than technical experts, though. They also need to be able to identify what questions their organizations should be asking, as well as how to source the data required to answer them.

Organizations need to be able to rely on their data scientists to accurately find, manage, and analyze unstructured data en masse; these findings should then be synthesized into easily understood formats for company stakeholders using them to make strategic business decisions.

Data science continues to change

Data science is crucial to most businesses already; how it’s important may change but this fact will not. Data science is already quickly integrating with artificial intelligence, which connects related data for later use by detecting meaningful patterns in large collections of data. But while increasingly useful and complex, AI isn’t yet a real match for human intelligence. AI’s still require massive amounts of data to perform even simple tasks, like editing letters.

Data science in your organization

Machine learning-based AI will be an important next step in data science theory and practice. What it won’t change is where in your organization data science work should sit.

Data Science hierarchy of needs

Source: Hackernoon

Data science should be at the top of your pyramid of key tasks. But you can’t optimize your business, improve products or services, or make good business decisions without first building complete and robust data foundations.

Any structure set upon shaky foundations will eventually shift, fall, and smash; whether your structure is a new office tower or virtual infrastructure housing organizational data, only one type of professional can ensure your foundational success: a properly trained engineer.
 

Data engineering defined

Data engineering is essential for ensuring your data can support your organization’s strategic goals. How data is made useful depends on the specifics of both the data and the organization to which it belongs. That said, data engineering generally always begins at the beginning—creating the storehouses (data warehouses, data lakes, etc.) needed to establish a workable corporate data life cycle.

The data warehouse is where your data engineer should focus their efforts: in building and sustaining a robust, accurate system enabling timely and useful data analysis and reporting. As your organization’s primary repository of integrated data, the data warehouse is where business data for making strategic business decisions comes from—and it should scale easily to support business growth.

Like other branches of engineering, data engineering is highly technical; further, the best data engineers are skilled in myriad related areas, such as programming and mathematics. And like their data scientist counterparts, they require a unique mix of hard and soft skills to succeed. Data engineers should follow, and be able to communicate, data trends to colleagues, offer guidance on using organizational data, as well as

  • track how data travels within the company’s infrastructure and maintain that infrastructure’s integrity;
  • collect data and determine its best uses (e.g., when and where to automate, developing data set processes, etc.);
  • clean, sort, and organize data;
  • provide data scientists with data ready for running algorithms and queries, including for more complex tasks like machine learning, predictive analytics, and data mining;
  • constantly safeguard and/or improve data reliability, accuracy, and quality; and
  • create analytics and machine-learning programs for data science-focused peers.

A closer look at required data engineering skills

The above are pretty high-level requirements; let’s look into some important specifics now. It’s a commonplace in the data world that SQL is data’s first language. Any data engineer worth their salt should be expert in SQL/DML/DDL primitives, database execution plans, entity-relationship modeling, dimensional modeling, normalization and denormalization, and indices, different join algorithms, and distributed plan dimensions functionality.

Just as importantly, while data engineers have to be able to perform all this without hesitation at any moment, traditional drag-and-drop ETL (Extract Transform and Load) data engineering is being supplanted by a more programmatic approach—an approach that works best when data engineers and data scientists work together. Right now, two key tasks in particular depend on these professionals working in close cooperation:

ETL design

Writing efficient, resilient, scalable ETL will define the future of data engineering—and of data-driven businesses. Knowing how to use all available resources, including databases and related technologies, will mean the difference between success and failure in organizations’ increasingly sophisticated—and dependent—relationship with their own data.

A quick visual reminder of how ETL works:

ETL Extract Transform Load

Source: Vineet Goel

  • Extract. Receive upstream data, then move them to final or incremental locations.
  • Transform. Turn raw data into analysis-ready datasets.
  • Load. Send processed data either to a final stop for use or to another interim location for additional ETL treatment.

ETL is crucial to successful data warehousing but it’s not the only piece in the data engineering puzzle.

Data modeling involves extracting business information through carefully crafted schemas and data relation information. This is a design-first approach to data engineering, one that—especially when formed in star schema, rather than linearly—provides a complete and more easily examined view of key business segments, audiences, users, etc.

Data engineering is changing

Like data science, data engineering is not a static discipline. No longer focused on assembling one-size-fits-all data pipelines or integrating straightforward SQL-sourced data changes, data engineers are indispensable in data-heavy organizations. Data engineers build and manage corporate data infrastructure, made-to-order ingestion pipelines, and non-SQL transformation pipelines for data science counterparts. They also support the whole organization with data resource optimization and their extensive knowledge of both internal and external data trends. Companies basing their success on their data simply can’t thrive without a data engineer (or several) ready to work with a similarly committed data scientist.

Why data scientists and data engineers really need to work together

If you try to run a data-focused company without proper engineering support, your data scientist will suffer; if your data scientist suffers, your whole organization will as well. We can’t emphasize this enough: to do their jobs well and at capacity, data scientists need data engineers on side.

We conducted an informal LinkedIn survey, asking “How many #datascience people here are regularly tripped up by #dataengineering infrastructure issues? We regularly read about 80/20 issues related to cleaning data vs. analyzing it, but want to see if this is true in practice. Are you really wasting your time working on #ETL, #datawarehouse, and #datalake issues?”

Most respondents asserted that the 80/20 “rule” was accurate; that is, 80% of data work should focus on preparing it for use by experts like data scientists. Noting that it can be difficult to convince hiring teams that they really need to staff that 80%, respondents also highlighted the following problems arising from not having data engineers (or “data janitors!”) on staff:

  • reduced efficiency, quality, and reproducibility resulting from lack of proper processes and infrastructure;
  • the “key person risk” problem: depending too much on one employee for mission-critical work;
  • ad hoc approaches to data sometimes work in the short term but always fail down the line; and
  • subpar data engineering is the primary cause of failed data science projects.

These difficulties don’t appear to be improving. The following chart highlights Google searches for data engineering vs. data science vs. machine learning vs. artificial intelligence. Notice which category isn’t growing? Data engineering.

google trends data engineering vs data science vs ML vs AI

Source: Google Trends

Further, a quick search for articles on the topic of “data engineering” from Harvard Business Review suggests these Google findings aren’t unusual. Try entering "data engineering site:https://hbr.org" in Google, and compare this with "data science site:https://hbr.org" As of January 2020, only 19 results come up in the former search, and most don’t actually have the phrase "data engineering" in the title. Searching hbr.org for “data science,” on the other hand, yields hundreds of articles—about half of which contain this search term in their titles. What gives?

We see two issues here, both of which we’ve aimed to redress in this article. First, a surprisingly large number of people—including management in data-driven organizations—don’t understand that data science and data engineering are complementary rather than equivalent disciplines.

The second problem is, even in organizations that do hire for both roles, the data scientist too often ends up in the business unit while the data engineer is relegated to IT. Unless such an organization is unusually agile, you can be sure data siloing will occur if the professionals in both roles aren’t interacting daily.

Drilling down: What ignoring data engineering means for the mining industry

In our experience, mine sites usually have only one person gathering data and creating reports for head office. This is a time-consuming process and that employee may have other tasks to attend to; using Excel to organize their findings after hastily extracting incomplete data from multiple systems might be the best they can do.

We surveyed our Reddit connections about two related issues, asking: “a) Does this describe the experience at your site? and b) Does the effort involved mean you avoid preparing similar reports for the local team? How do local teams (superintendent, foreman, etc.) access this information? Do they even access it?”

The results are sadly not surprising, with the answer to a) being mostly yes. And a to both a) and b), these issues were flagged:

  • even when more sophisticated options, like tableau dashboards, that pull data from Excel sheets updated daily are available, people tend not to use them;
  • automating data collection, organization, and sharing would result in job loss(es); and
  • some data is available to dispatchers in real time, so management feels no pressing need to have additional analytical staff on payroll.

Exceptions seem to occur only in very large mining operations, where it would be impossible to manage data manually. Fair enough. Perhaps you’re wondering if smaller mining operations can get the same relative value from their data using self-serve tools as large mines get from having strong data science and engineering teams? Simply put, no.

Without a data engineer, small and medium mines may be able to get mostly accurate reports from staff who may or may not know what queries to run. But such organizations will very be lucky if they don’t end up with either no reports, multiple reports run by untrained staff members using different search criteria, or reports that contradict each another. You see the problem. Only dedicated experts working together can provide a single source of accurate, usable truth.

Don’t wait to build your co-led science and engineering data team

Whatever your industry and regardless of how large your organization, you can benefit from investing in both data science and data engineering, then ensuring they work together by default and as effectively as possible.

Business reliance on data is not only going to continue; it’s going to continue growing exponentially as we see more and more ways to harness it—and as the tools we build, like machine learning and AI, become better able to learn from both what we do and what we have them do. It’s really almost now or never for building the data team that will secure your organization’s future. What are you waiting for?

This article originally appeared on the 3AG blog as Data science and data engineering (should) go together like peas and carrots

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OK