Data science or results science: are we shifting away from statistics?
When looking for the answer to a problem, is understanding the process that leads to the answer necessary? The school of data science seems to take no notice of it, while the school of statistical science makes it its goal. This stumbling block is the centre of debate regarding the relationship between these two disciplines. What link, what differences and what complementarities are there between them? These are the questions that an international team of researchers of various specialities is trying to answer.
Data is evermore present in our daily lives, be it in health, social networks, industry, logistics, video games, or others. No sector can escape it. Times are adapting to these changes and science is doing the same. Data science is gradually supplanting statistics to process this information flow. It has reached such a point that there is competition between the two disciplines, be it in academia or in industry. Their relationship is a burning issue.
Nicolas Vandeput, an independent consultant that specialises in inventory management and sales forecasting, and academic at CentraleSupélec, is trying to identify the main differences between these two disciplines and any real complementarities, with the support of an international team of statistics researchers and data scientists. Using a conceptual analysis, the team has concluded that statistics and data science are growing ever closer. The team has also highlighted their main limitations and has outlined the way forward for data analysis.
A lack of consensus
Percentages, sampling, testing hypotheses and more. Statistics studies the possible correlations and causality between several variables. They give meaning to a set of primary data. Data science has the same goal, but unlike statistics, it can further analyse data. Studying them requires different tools, such as machine learning. This is a mathematical model that trains a computer and pushes it to learn and improve itself autonomously. The aim is to make predictions. This is because analysing a lot of data requires several people and machine learning lightens the load. The difference between the two disciplines could end there but several grey areas remain. Does this mean that data science and statistics never come together? Are the tools used so far removed from one another? Despite their different approaches, do these two disciplines tend to supply the same information?
By reviewing the different arguments presented in scientific research, Nicolas Vandeput and his colleagues have identified two distinct angles. For some, the two disciplines return the same thing and statistics is a (key) part of data science. Using many statistical tools in data science is proof of this. For others, the two disciplines are very separate and cannot be compared, and that work in data science does not require use or knowledge of statistical models.
Even the scientific nature of data science is sometimes called into question; although data and a specific methodology are scientific components, they are not necessarily a new discipline. Compared to statistics, data science is a new field of research. “Think of it this way, statistical science remained quite stable for many decades, and then, all of a sudden, data science turned up,” says Nicolas Vandeput. Since then, reaching a consensus about the role and importance of these two disciplines has been the subject of recent debate.
Data science: a fledgling science
Every era has its own needs. Our current one, which follows an era in which data was rare and the challenge was to extract the most information from it, is marked on the other hand, by an abundance of data. Data science appeared to address new analysis needs, since statistics alone were technically unsuitable. This is reflected by the ‘four V rule’ which defines Big Data: Volume, which calls for collecting a large amount of data, Value, which determines the added value of the result, Veracity, which questions the reliability of the data and Velocity, which is the ability to process a large amount of information in a short amount of time. Since data science has so many advantages, why is there still competition with statistics?
There are several limitations to data science. For example, it struggles to work on small samples or conduct significance testing. The ‘Black Swan’ is also a major obstacle. This concept refers to events that are impossible to predict using a data science model. ‘Over-learning’ and ‘under-learning’ also contribute to skewing the model: either it is too adapted to training data, which penalises new data, or it is not trained enough to be able to understand the trends of the data in question. Lastly, the interpretability of the model is key in being able to successfully explain and understand the decisions that have been conducted. Yet, data science is technically unable to do conduct such an operation: “The neural network cannot explain itself. It cannot provide explanations and it is not a set of simple logical rules,” explains Nicolas Vandeput. The major limitation here is not the effectiveness of models, but rather, the sense of trust that they generate.
The team of researchers concluded that these two disciplines do, in fact, tend to work together. Beyond the differences regarding their origins, methodologies and tools, statistics and data science share the same dynamics and aim, which is furthering data analysis and proposing predictions. Their complementarity is being developed and depends on the evolution of these disciplines. Data science is still limited but is progressing quickly. “In the field of Machine Learning, models from before 2015 are already out of date,” says Nicolas Vandeput. Furthermore, statistics are part of this because they are used in the analysis. They are used to structure data, and to deepen and evaluate their validity. Interpretability, the ability to explain the process which leads to a result, is, as of yet, still impossible to do in data science. Will we one day be able to suggest a real union between these two disciplines? A marriage between Big Data’s techniques and interpretability would make lovely children.
- Hassani, Hossein et al. 2021. « The Science of Statistics versus Data Science: What Is the Future? » Technological Forecasting and Social Change, 173: 121111.