When the Data Scientist role “was relatively new” in 2012, the authors observed that “as more companies attempted to make sense of big data, they realized they needed people who could combine programming, analytics, and experimentation skills.”
Davenport and Patil were correct in their predictions about data science’s importance and ubiquity: As of “2019, postings for data scientists on Indeed had risen by 256%”—an exponential increase in demand few professions can boast.
And while people believed “in 2012 was that data scientists could do all required tasks in a data science application…there has been a proliferation of related jobs to handle many of those tasks, including machine learning engineer, data engineer, AI specialist, analytics and AI translators, and data oriented product managers.”
Diversifying data management should have begun in earnest in data science’s earliest days. In 2012, “many data scientists noted that they spend much of their time cleaning and wrangling data”—which is unsurprising. But the authors found in 2022 that “that is still the case despite a few advances in using AI itself for data management improvements” (our emphasis).
What’s (still) the issue?
Sexy and no one knows it?
Data Engineer continues to lag behind Data Scientist as a key role in most companies, across sectors—a fact reflected in there being only one data engineer at the Summer 2022 DMNA conference! And data engineering has never been described as sexy.
To extend the relationship metaphor, data engineering is the stable nice guy too often overlooked in favor of more exciting, even potentially dangerous, alternatives. Or, just as disappointing, as the HBR writers have noted—and which we explored in January 2020—data engineering is still being mixed up with data science.
In our 2020 article on this topic, we discussed how these disciplines can and should complement each other. Both clearly remain less fully understood than is ideal; but we think data engineering may still suffer so much ongoing incomprehension because
- People (think they) know what science (of all kinds) covers but are fuzzy on engineering (of all kinds); Bill Nye the Science Guy and Neil deGrasse-Tyson may have a lot to answer for here.
- Industry over-emphasis on data science—from articles like the HBR one to companies hiring for data management—simply amplify the buzz around data science, which still obscures the importance and even existence of data engineering.
- Change can be scary and exhausting, and life has been entirely too full of change since early 2020.
Data science vs. data engineering (in brief)
Data science pulls insights from your corporate data, both structured and unstructured; it comes in 3 basic forms:
- descriptive (what has occurred)
- predictive (what may or will happen)
- prescriptive (how to deal with what happens and/or is discovered)
We stand by our original argument that “Data mining and predictive analytics are data scientists’ most important skills” but that “they also need to be able to identify what questions their organizations should be asking, as well as how to source the data required to answer them.”
Data engineering, on the other hand, involves
- gathering, cleaning, sorting, and organizing data
- ensuring data accuracy and quality
- detecting and sharing data trends with colleagues
- helping define company data use
- tracking data movement through company infrastructure and maintaining infrastructure integrity
- setting Data Scientists up for success with data ready for running algorithms and queries, as well as for machine learning, predictive analytics, and data mining
Deep, sustainable business success relies on a “both” rather than a “one or the other” scenario. Data Engineers and Data Scientists should be interacting daily, so they don’t end up in silos; the worst result of which may be Data Scientists building models with imperfect or out-of-date data, then passing these errors on to the rest of the company.
A long engagement—but the big day is coming
It’s time, first, to recognize that data science requires data engineering to be fully effective, then to bring them usefully together. The good news is, the world of data (which is almost the whole world) is really starting to get it.
There are myriad courses, certificates, and other learning paths proliferating around data engineering. In a 2022 article about how to become a Data Engineer, Coursera notes that Data Engineers can “make a tangible difference in a world where we’ll be producing 463 exabytes per day by 2025.” For perspective: That’s a billion gigabytes every 24 hours; only 1-2GB are required to download a movie.
Further, Analytics India Magazine points out both “the increase in digital transformation after the pandemic and the explosion of data following it,” as well as a 2020 Dice Tech Job Report that found “data engineering to be the fastest-growing job in technology with a predicted 50% year-over-year growth in the number of open positions.”
Reflecting this trend, AIM noted that in 2021, LinkedIn published over 29 thousand DE job ads as organizations contended “with not enough data engineering talent in the market.”
As data engineering becomes both more understood and more ubiquitous, its functions will evolve in lockstep with our collective, ever-expanding relationship with data.
Some predict Data Engineers will increasingly manage “tasks farther up the value chain, such as data modeling, quality, security, management, architecture, and orchestration,” continue “adopting software engineering best practices,” and become conversant in “agile development, code testing, and version control practices – to name a few”—all of which will lead to “new roles within data engineering.”
Setting the date for a blissful union
Evidence, both quantitative and anecdotal, suggests data engineering may still be under-appreciated and under-represented by companies trying to harness their data. However, the evidence also suggests that as organizations, and the markets they serve, continue to better understand data and its importance, a proper balance between data science and engineering will follow.
It’s becoming clear just how crucial it is that both data engineering and data science be represented, respected, and supported in businesses of all kinds. Such cooperation between these disciplines is particularly crucial in manufacturing and mining—neither of which can afford to have incomplete, dirty, or old data going to their decision-makers.