The attitude of a data scientist
By John Tukey
More than half a century ago in “the future of data analysis”, John Tukey envisioned a yet unrecognized field of science that is concerned with providing answers to realistic problems by means of experimentation, analysis of data, and judgements that may be guided by theory. In 1962, he used data analysis to describe this field but today we can recognize it as data science. As part of his seminal paper, John lists what he thinks ought to be necessary attitudes for a data scientist. Since this is an important part of data science, I thought to share this part of John’s paper with all data scientists out there. Also, because we are now in the “future” with respect to Tukey, I have taken the liberty of changing all occurances of ‘data analysis’ to ‘data sience’ in his original text. To see the original document: sed 's/data science/data analysis/gI'
.
What are the necessary attitudes? Almost all the most vital attitudes
can be described in a type form: willingness to face up to X. Granted that facing
up can be uncomfortable, history suggests it is possible.
We need to face up to more realistic problems. The fact that normal theory,
for instance, may offer the only framework in which some problem can be tackled
simply or algebraically may be a very good reason for starting with the normal
case, but never can be a good reason for STOPPING there. We must expect to
tackle more realistic problems than our teachers did, and expect our successors to
tackle problems which are more realistic than those we ourselves dared to take on.
We need to face up to the necessarily approximate nature of useful results in
data science. Our formal hypotheses and assumptions will never be broad
enough to encompass actual situations. Even results that pretend to be precise
in derivation will be approximate in application. Consequently we are likely
to find that results which are approximate in derivation or calculation will
prove no more approximate in application than those that pretend to be precise,
and even that some admittedly approximate results will prove to be closer
to fact in application than some supposedly exact results.
We need to face up to the need for collecting the results of actual experience
with specific data-analytic techniques. Mathematical or empirical-sampling studies
of the behavior of techniques in idealized situations have very great value, but
they cannot replace experience with the behaviour of techniques in real situations.
We need to face up to the need for iterative procedures in data science. It is
nice to plan to make but a single analysis, to avoid finding that the results of
one analysis have led to a requirement for making a different one. It is also
nice to be able to carry out an individual analysis in a single straightforward
step, to avoid iteration and repeated computation. But it is not realistic to be
lieve that good data science is consistent with either of these niceties. As we
learn how to do better data science, computation will get more extensive,
rather than simpler, and reanalysis will become much more nearly the custom.
We need to face up to the need for both indication and conclusion in the same
analysis. Appearances which are not established as of definite sign, for example,
are not all of a muchness. Some are so weak as to be better forgotten, others
approach the borders of establishment so closely as warrant immediate and active following up. And the gap between what is required for an interesting indication and for a conclusion widens as the structure of the data becomes more complex.
We need to face up to the need for a free use of ad hoc and informal procedures
in seeking indications. At those times when our purpose is to ask the data what
it suggests or indicates it would be foolish to be bound by formalities, or by any
rules or principles beyond those shown by empirical experience to be helpful in
such situations.
We need to face up to the fact that, as we enter into new fields or study new
kinds of procedures, it is natural for indication procedures to grow up before the
corresponding conclusion procedures do so. In breaking new ground (new from
the point of view of data science), then, we must plan to learn to ask first of
the data what it suggests, leaving for later consideration the question of what it
establishes. This means that almost all considerations which explicitly involve
probability will enter at the later stage.
We must face up to the need for a double standard in dealing with error rates,
whether significance levels or lacks of confidence. As students and developers of
data science, we may find it worth while to be concerned about small difference
among error rates, perhaps with the fact that a nominal 5 % is really 4 % or 6 %,
or even with so trivial a difference as from 5 % to 4.5 % or 5.5 %. But as practitioners of data science we must take a much coarser attitude toward error
rates, one which may sometimes have difficulty distinguishing 1 % from 5 %,
one which is hardly ever able to distinguish more than one intermediate value
between these conventional levels. To be useful, a conclusion procedure need
not be precise. As working data scientists we need to recognize that this is so.
We must face up to the fact that, in any experimental science, our certainty
about what will happen in a particular situation does not usualy come from directly
applicable experiments or theory, but rather comes mainly through analogy be
tween situations which are not known to behave similarly. Data science has, of
necessity, to be an experimental science, and needs therefore to adopt the attitudes of experimental science. As a consequence our choices of analytical approach will usually be guided by what is known about simpler or similar situations,
rather than by what is known about the situation at hand.
Finally, we need to give up the vain hope that data science can be founded
upon a logico-deductive system like Euclidean plane geometry (or some form
of the propositional calculus) and to face up to the fact that data science is intrinsically an empirical science. Some may feel let down by this, may feel that
if data science cannot be a logico-deductive system, it inevitably falls to the
state of a crass technology. With them I cannot agree. It will still be true that
there will be aspects of data science well called technology, but there will also
be the hallmarks of stimulating science: intellectual adventure, demanding
calls upon insight, and a need to find out "how things really are" by investigation and the confrontation of insights with experience.
-- John Tukey (The Future of Data Analysis) with minor modification by Hatef Monajemi