ISE Magazine August 2018 Volume:50 Number: 09
By Michel Baudin
From statistics to data science
Among the many terms currently applied to the art of analyzing data, data science is most descriptive of the field as a whole. You also hear of data mining, machine learning, deep learning or big data, all of which are often conflated but describe subsets or applications of data science. Strictly speaking:
- Data mining is the analysis of data collected for a different purpose, as opposed to design of experiments (DOE), where the data is collected specifically for the purpose of supporting or refuting a hypothesis.
- Machine learning is what is done by algorithms that become better at a task as new data accrues. For example, a neural network may be designed to recognize a handwritten “8” and to improve its performance with experience.
- Deep learning doesn’t mean what it says – acquiring deep knowledge about a topic. It designates multiple layers of neural networks where each layer uses the output of the layer below.
- Big data refers to the multiterabyte data sets generated daily in e-commerce, from click throughs to buying and selling transactions. Manufacturing data sets don’t qualify as big data. True big data is so large that it requires special tools, like Apache’s Hadoop and Google’s MapReduce, and I have never heard either mentioned in a manufacturing setting.
Data science is a broader umbrella term that is, if anything, too broad. Taken literally, it could encompass all of information technology. As used in most publications, data science does not cover data acquisition technology but kicks in once it has collected data and it produces human-readable output to support decisions by humans. Data science does not include the use of data to control a physical process, as in 3-D printing, self-driving cars or CNC (computer numerical control) machines. The exceptions include Li-Ping Chu’s book Data Science for Modern Manufacturing, which is all about how manufacturing should be and perhaps will be. Until then, it is the way it is, and data science, as understood here, is helping to make it better.
When a machine is no longer a machine
The query tools of relational databases are the workhorses of data wrangling, but they are not sufficient, as data do not always come in tables but sometimes in lists of name-value pairs in a variety of formats like JSON or XML that first must be parsed and cross-tabulated. You also need more powerful tools to split “smart” part numbers into their components, identify the meaning of each component and translate values into plain English. And you need even more sophisticated text mining tools to convert free-text comments into formal descriptions of events by category and key parameters.
It doesn’t work perfectly. You may be able to recover only 90 percent or 95 percent of your data, but then you not only have a clean data set but also a set of wrangling tools that can then be incrementally applied to new data and enrich this data set, which begs the question of where to keep it. A common approach is to use a special kind of database called a data warehouse, into which you load daily extracts from all the legacy systems after they have been cleaned and properly formatted. They can then be conveniently retrieved for analysis.
The part of the data warehouse that is actually used for analysis may be a small fraction of its content, but you don’t know ahead of time which fraction. As a result, most of the data that is prepared and stored in the warehouse is never used. This has motivated companies with very large data sets, as in e-commerce, to come up with another approach called the data lake, into which you throw data objects from multiple systems in their original formats and prepare them for analysis if and when you have established that they are needed.
Whether a data warehouse or data lake is preferable in an organization is a question of size. With small data sets, the penalty for preparing all data is small when weighed against the convenience of having it ready to use.
Analyzing the data
With clean data, you are finally at the statistician’s starting point. The first step is always to explore the data with simple summaries and plots of one or two variables at a time, and this is often sufficient to answer many questions. Being a good data scientist is about making the data talk, not about using a particular set of tools.
Data science training leaves you with a box full of tools that you don’t necessarily know what to do with, bearing names that are not self-explanatory like k-means clustering, bagging, the kernel trick, random forests and many others. They were developed to solve problems but, to you, they are cures in search of a disease and answers to questions you don’t have. The topical literature fails to answer the three questions British consultant John Seddon recommended asking about any tool:
- Who invented it?
- What problem was he or she trying to solve?
- Do I have this problem?
In data science, when a tool was invented is also essential because its use requires information technology. The tools of the 1920s rely on assumptions about probability distributions to simplify calculations; the ones from the 1990s and later require fewer assumptions and involve multiple simulations.
You find out, for example, that logistic regression has nothing to do with moving goods and was invented in 1958 by David Cox to predict a categorical outcome from a linear combination of predictors that can be numbers or categories. In manufacturing, it will tell you how relevant the variables and attributes you collect in process are to a finished unit’s ability to pass its final test. If they are not relevant, you may stop collecting them and can look for better ones; if they are relevant, you can modify the final test process to leverage the information these variables provide. Logistic regression can also be used to improve binning operations.
That it’s from 1958 tells you that using it on a data set with 20,000 points and 15 predictors is unlikely to overtax a 2018 laptop or tablet. In this particular case, the name of David Cox does not add much information because he was a theoretician, as opposed to others who worked on specific applications, like W. Edwards Deming in manufacturing quality or Brad Efron in epidemiology.
You may ask what your problems have in common with epidemiology. Not only are you likely to find that you have no use for many of the tools in the published data science toolboxes but also that you have problems none of them address. Whether it is about demand, bookings and billings or technical product characteristics, manufacturing data come in the form of time series. There are many tools for visualizing, analyzing, modeling and controlling time series, but they are just off the data science lists.
Once you have established that a tool may be useful to you, you need to learn how to use it. You don’t need to plough through the underlying math any more than a car driver needs to understand the theory of engines. It can remain a black box to you, but you still need to know how to feed it data, what the various settings do and how to interpret the output. By itself, this is not a trivial investment in time and effort and needs to be done selectively.
The presentation of results to stakeholders who are not data scientists is past the statistician’s end point. The results are moot unless they can be communicated to decision-makers in a clear and compelling fashion.
The art of generating reports, slide sets, infographics and performance boards is not taught in statistics courses and not covered in statistics textbooks. It is often entrusted either to engineers who are poor communicators or to graphic artists who do not understand the technical content and produce charts that decorate rather than inform or persuade.
In business, the report, with a narrative in complete sentences and annotated charts, is a dying art, replaced by the slide set with bullet points that are not sentences and graphics that are limited to 3-D pie charts and stacked-bar charts. When reports are produced, they are expected to fit on a single A3 or 11-by-17-inch page.
This works for many activities, but data science isn’t one of them. With slides and A3s alone, you can gloss over gaps in logic that would be exposed in report writing and prompt authors to fill them. Slides and A3s are useful, respectively, as visual aids for oral presentations and as summaries, and as a supplement to a fully baked, objective and rigorous statement of analysis and results, expressed in layman’s terms and with all appropriate nuances and caveats.
That executives are “too busy” to read reports is only true for reports that haven’t been designed to be read by busy executives. An executive always has the time to read a one-page summary – possibly an A3 – and spot-check the research behind the conclusions at three locations within the report. Reading it cover to cover is not usually necessary, particularly if the report has been designed with this use in mind.
The communication of data science is heavily graphic. Rather than limit themselves to a small set of standard charts that have been used in manufacturing for a century, engineers should expand their horizon, use more types of charts, embed them in infographics and leverage the insights of a researcher like U.S. statistician Edward Tufte. In addition, when a report is produced in electronic form, illustrations are not limited to still images. Swedish statistician Hans Rosling’s Trendalyzer, for example, has an animation that shows a scatterplot changing over time. A histogram can also come with a slider bar to allow the reader to instantly see the effect of changing bin sizes.
The reports that are vanishing in business live on in academic papers, with abstracts in place of executive summaries. In many fields, these papers are, in fact, data science reports, and they are not without challenges. First, academia’s review process does not always work. “Growth in a Time of Debt,” for example, an influential 2010 paper by Harvard economists, was exposed in 2013 by students as containing calculation errors.
Second, when an academic paper is cited, the conclusions are often amplified beyond recognition. This is how a lighting study conducted on just five women assembling relays at Western Electric’s Hawthorne plant in the late 1920s spawned the belief in a “Hawthorne effect” that makes all the workers of the world more productive when management pays attention to them.
Data scientists cannot prevent journalists, politicians or even work colleagues from oversimplifying and distorting their work, but it behooves them to speak up when it happens. They are responsible for the quality of the work, including not only sound analytics but effective communication as well.
Better tools, better data, a better future
The software toolkit of most engineers and managers in manufacturing is limited to Excel and PowerPoint, with the addition of Minitab for Six Sigma black belts.
These options don’t cut it for data science, and there are plenty of options for all stages, from data wrangling to analysis and presentation. Some tools are free, powerful and reliable but require a high level of skills from users. Others are “for everyone” and available for fees. Regardless of what data tools you choose, the main investment is in learning to apply them. In that respect, data science and its tools are analogous to the manufacturing sector’s production machinery.
Published at :