28 March, 2020#data science

Data science is not a science and that's OK

While my focus has largely shifted to web development, I used to be more involved in data-scinece-y stuff both at work and during my personal time studying. I still occassionally and casually consume relevant materials, be it a blog article, a YouTube video1, or something else.

Recently, I came across an episode of Practical AI podcast series produced by Changelog, titled "What exactly is 'Data Science' these days?" While I mostly enjoyed listening to the episode, it reminded me again I am a no fan of the word "data science".

So, the following is my personal thoughts on "data science":

Data science is not a science

In my view, there is something fundamentally misleading about the term "data science". Maybe it's just an aspirational name or a cynical marketing pitch to appear more authoritative. Either way, despite "science" in its name, the goal of "data science" is not scientific.

In practical terms, "data science" is about exploiting modern computational tools and advanced algorithms to efficiently extract information from larger and more diverse datasets (and skillfully present that information). In doing so, "data science" adopts certain elements of scientific research, like hypothesis testing.

Nonetheless, it puts little emphasis on the collective labor of formulating and advancing humankind's understanding of reality, which, in my mind, is key to any scientific enterprise. And when "data science" relates to science and scientific knowledge, it does so as a consumer rather than a contributor.

This is not to say that "data science" practitioners cannot participate in scientific work or contribute to scientific literature. In fact, many accomplished "data scientists" are those trained at top universities as scientific researchers. I'm sure some of them are active participants of the scientific process.

In the end, however, their presence and influence in the "data science" world may have exacerbated the misconceptions invoked by the name of "data science," which in itself has little to do with contributing to the growing body of scientific knowledge.

What data science does

In practice, "data science" refers to a broad range of arts and crafts for handling and analyzing data as well as communicating the analysis results. And its goal is almost always tied to better decision making mostly in the business context.

Some people seem to attach the name "data science" to specific software tools or statistical models, e.g. one must write Python code to fit fancy machine learning models to be considered a proper "data scientist." My impression is that this kind of argument is most often pushed by online courses sold to students, but that's totally unnecessary.

That is why it is possible to argue that one can do "data science" using Microsoft Excel or writing SQL queries. Dashboarding tools like Tableau and Microsoft Power BI also claim their places in "data science." Some may not like it, but it seems that you are doing "data science" as long as you can dig out those insights from data.

Of course, in more advanced settings, the work of "data science" requires modern and powerful computational solutions in order to work with a truly large volume and variety of data efficiently. To make it work, you need distributed computing, complex machine learning models, robust data pipeline infrastructure, and all that jazz. Still, these cases are rare and many "data scientists" continue to work with maybe a few tens of thousands of records at a time.

What data science is

I consider "data science" ultimately as a rebranding of business intelligence/analytics.

It is a rebranding because it does not bring anything fundamentally new to the table despite its marketing pitch. What has changed is, rather, the environment in which more data (both in terms of quantity and format) and more powerful tools (both software and mathematical/computational constructs) are now made available. We may see this as the evolution of an existing field, but not something radically new and certainly not a new branch of science.

It is business intelligence because the goal is better decision making. In this sense, if a fitted machine learning model is a product in itself, or constitutes the core part of a product, it is no longer "data science." However, collecting the data generated from its uses and analyzing that data to improve the product or service do fall in the realm of "data science."

And that's OK

All of this talk is not to dismiss "data science" and what it has to offer. "Data science" does not need to pretend like it's a "science" to be great. Modern science relies on mathematical language and concepts but it's no mathematics, either. Does this mean science is somehow inferior to mathematics? No. They are just different.

And what's wrong with better business intelligence? Decision making is an important part of life, and if we can employ modern technologies in a disciplined manner to improve that, it can do wonders.

The name "data science" is everywhere already. And maybe "science" in its name helped it to get widely popularized. But now that it is no longer the latest hype in the market, I'd like to see some more introspection and maybe another rebranding--this time, for the sake of sanity and self-respect.