On Exactitude and Messiness in (Big) Science (Data)

The stories of Jorge Luis Borges, who tellingly spent most of his life as a librarian, read like thought experiments (Stephen Ramsay would approve: “an impossible fantasy constructed for the purpose of divining the possibilities of the real” [23]), and seem to crop up again and again when discussing matters of information, representation, and hypertextuality. As I’m reading Viktor Mayer-Schönberger and Kenneth Cukier’s Big Data, the Borges story that comes to mind is “On Exactitude in Science,” a paragraph-long fable you can quickly read if you haven’t. Jean Baudrillard uses the story to open Simulacra and Simulation, an appropriately fake copy of which appears in the Waichowskis’ The Matrix (where Neo keeps a cracked disc he sells on the black market), which also borrows from Baudrillard’s summary the phrase “Desert of the Real,” subsequently used by Slavoj Zizek in Welcome to the Desert of the Real. In his widely cited article “Database as a Genre of New Media,” Lev Manovich reworks the story for databases:

“Jorge Luis Borges’s story about a map which was equal in size to the territory it represented became re-written as the story about indexes and the data they index. But now the map has become larger than the territory. Sometimes, much larger. Pornographic Web sites exposed the logic of the Web to its extreme by constantly re-using the same photographs from other pornographic Web sites. Only rare sites featured the original content. On any given date, the same few dozen images would appear on thousands of sites. Thus, the same data would give rise to more indexes than the number of data elements themselves.” (181)

My opinions on Manovich have shifted somewhat since discussing the shortfalls of his “database and narrative are natural enemies” (181) formula in light of responses surrounding Ed Folsom’s “Database as Genre,” but Mayer-Schönberger and Cukier’s Big Data compels me to wonder how Borges’s story would be “re-written” for Big Data.

Mayer-Schönberger and Cukier argue Big Data is defined by three features: there’s “More” data, it’s “Messy,” and it privileges “Correlation” (what) over causation (why). I bring up Borges’ “On Exactitude in Science” in response to the second feature: “a willingness to embrace data’s real-world messiness rather than privilege exactitude” (19). Ironically, the authors are consistenly inexact about this exactitude-messiness distinction. To illustrate, they compare three different scenarios, two non-Big Data, and one Big Data, “to measure the temperature in a vineyard” (34).

Scenario 1 (Non-BD): “If we have only one temperature sensor for the whole plot of land, we must make sure it’s accurate and working at all times: no messiness allowed” (34).

This scenario is misleading. No system is 100% efficient, no matter how accurate or consistently operational. If you have one temperature sensor for a “whole plot of land,” your reading’s accuracy relative to the whole is never going to be 100% accurate, and will vary depending on the size of the plot, whose temperature is not going to be consistent throughout. This is the same mistake people make when they listen to “a 20% chance of showers” forecast. It actually means there is a 20% average probability of rain in all the subregions analyzed. So in one neighbourhood of your city there could be a 8% chance of rain, and 80% in another, but there is no way of knowing from the shorthand recited to you by newscasters. Moreover, the kind of sensor used is going to matter; Wikipedia lists six different kinds of “temperature sensors”; the authors don’t even specify if we’re dealing with a mechanical instrument or an electrical one. This additional information matters; the authors shouldn’t be using such a hypothetical scenario, especially when they’re trying to inform us about differences in information yield across methodologies.

Scenario 2 (non-BD): “In contrast, if we have a sensor for every one of the hundreds of vines, we can use cheaper, less sophisticated sensors (as long as they do not introduce a systematic bias). Chances are that at some points a few sensors may report incorrect data, creating a less exact, or ‘messier,’ dataset than the one from a single precise sensor. Any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture. Because this dataset consists of more data points, it offers far greater value that likely offsets its messiness” (34).

First, the authors imply that because the sensors are less sophisticated, more of them will report incorrect data. The only reason we did not see this phenomenon in Scenario 1, then, is because we only had one sensor, which, relative to the other elements in its set of one, was 100% accurate. However, as discussed, in reality no sensor is 100% accurate. Therefore this bit about incorrect data amounts to only a general statistical observation, that a large set of objects is likely to include variations—something to which the “single precise sensor,” as one sensor among a group of many other possible choices of sensor, is no less subject; the authors just didn’t think this was the case because they were blind to the blindspots of that sensor’s precision which they meaninglessly decided to assess self-relatively as 100%. By adding (relative to Scenario 1) two variables at once, more sensors AND less sophisticated sensors, the authors needlessly confuse a relative loss in sensory quality with the constant loss of sensory quality (noise, blindspots) inherent to any information system. This produces the illusion that the resulting data is “less exact” or “messier” only because of a loss of quality of the sensors, rather than also (and nonetheless) because more information (from more sensors) now needs to be made sense of. As a result, the “incorrect[ness]” of “[a]ny particular reading” is overstated as an absolute value instead of as a value relative to all other readings in the dataset, and the “aggregate of many readings” is overstated as “more comprehensive” without defining “comprehensive” (they seem to use it as a synonym along with “precise” and “accurate”) and therefore without also specifying the increased level of abstraction (again, the constant of noise) resulting from increased aggregation: we are no longer dealing with one data point, absolutely stated, but a measure of central tendency abstracted from multiple and relative data points—a measure, therefore, no less subject to blindspots however more “comprehensive.” This results in meaningless or messy statements like the biased “as long as they do not introduce a systematic bias” (their lesser degree of sophistication would itself already constitute a systematic bias!), and “greater value…likely offsets its messiness.”

Scenario 3 (BD) is Scenario 2, plus an increased frequency of sensory readings. Again, loss of quality arising from degree of sensory sophistication (here, they also add in issues of information transmission, which amounts to the same thing, just at a different point in the process), are conflated with issues arising from perspectival limitations, to the suppression of the latter. The authors state that, because of noise, “[t]he information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude” (34). It is true that a better ratio of information to noise results in a clearer message, but this has nothing to do with forgoing the chimera of “strict exactitude”; it is only the same situation we’ve been in all along, with each scenario. The “strict exactitude” of the first scenario only arises because both “volume” and “exactitude” are taken to equal one. The authors are so noisy in communicating their ideas that they miss their own information theoretic joke: “[t]he information will be a bit less accurate” is more than one bit inaccurate.

They then conclude about Scenarios 2 and 3 as follows: “In the first example, we sacrificed the accuracy of each data point for breadth, and in return we received detail that we otherwise could not have seen. In the second case, we gave up exactitude for frequency, and in return we saw change that we otherwise would have missed” (35). By “breadth,” they mean sample size: more sensors meant more data; but by “frequency,” don’t they also mean sample size? More sensors measuring more frequently means a larger sample (or higher sample rate). It thus seems to me that, contrary to their three special features of Big Data, the only difference between these three scenarios, as they’ve articulated it, is sample size. (In this, I agree with Ted Underwood that Big Data is about scale and scale is relative. So, just as Digital Humanities will one day be called humanities, Big Data will probably just be called Data, once Earth becomes but one data point among many.) And even then, they misunderstand sample size, because in Scenario 1, they think that a sample size of 1 will lead to the most “precise,” “accurate,” or “exact” results (terms which, so interchanged, cease to describe themselves).

Even when, as they elsewhere claim, Big Data measures a population, not a sample (a purview not unique to Big Data), it can only do so with the same biases inherent to any information system. Even if you’re measuring a population, the law of induction dictates that you will never know when you’ve actually obtained a “complete” population (you never will except arbitrarily), that population will only measure a sample of parameters, from a sample of endless possibilities. While the authors at least acknowledge that “messiness is not inherent to big data” but “a function of the imperfection of the tools we use to measure, record, and analyze information” (41), they also make claims like “Moving into a world of big data will require us to change our thinking about the merits of exactitude” (40) (maybe admitting the inexactitude of human methods has shock-value for the popular and business audience he’s writing for…), “Big data transforms figures into something more probabilistic than precise” (35) (statistics in general is a probabilistic science), “Processing big data entails an inevitable loss of information” (45) (information loss is inevitable, period), and “In many areas of technology and society, we are leaning in favor of more and messy over fewer and exact” (42) (fewer and exact are relative terms, and just as messy depending on how they’re used).

But lastly, most problematically, this kind of rhetoric and lack of understanding of the inherent biases of perspective and information systems, leads the authors to grandiose and paradoxical claims about Big Data’s access to “the totality of information,” though through “slackness instead of exactness” (47); that “Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy” (48); and that as it “become[s] a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of n=all of the mind” (48).

John Law, in an article on messiness called “Making a Mess With Method,” would characterize such viewpoints as only typical of the common-sense realist, who assumes that there is a world “out there” that is definite, singular, and anterior and indepedent to any viewpoint on it (5-6). But in reality, things are messy, and some “things are at least sometimes vague and can only be known vaguely” (5).

This all makes me think that the Big Data version of Borges’ “On Exactitude in Science” may be simply the same story: n=all.

Non-Digital Works Cited

Mayer-Schönberger, Viktor and Kenneth Cukier. Big Data: A Revolution That Will Change How We Live, Work and Think. New York: Mariner, 2014. Print.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Chicago: University of Illinois Press, 2011. Print.

On Exactitude and Messiness in (Big) Science (Data)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s