On Exactitude and Messiness in (Big) Science (Data)

The stories of Jorge Luis Borges, who tellingly spent most of his life as a librarian, read like thought experiments (Stephen Ramsay would approve: “an impossible fantasy constructed for the purpose of divining the possibilities of the real” [23]), and seem to crop up again and again when discussing matters of information, representation, and hypertextuality. As I’m reading Viktor Mayer-Schönberger and Kenneth Cukier’s Big Data, the Borges story that comes to mind is “On Exactitude in Science,” a paragraph-long fable you can quickly read if you haven’t. Jean Baudrillard uses the story to open Simulacra and Simulation, an appropriately fake copy of which appears in the Waichowskis’ The Matrix (where Neo keeps a cracked disc he sells on the black market), which also borrows from Baudrillard’s summary the phrase “Desert of the Real,” subsequently used by Slavoj Zizek in Welcome to the Desert of the Real. In his widely cited article “Database as a Genre of New Media,” Lev Manovich reworks the story for databases:

“Jorge Luis Borges’s story about a map which was equal in size to the territory it represented became re-written as the story about indexes and the data they index. But now the map has become larger than the territory. Sometimes, much larger. Pornographic Web sites exposed the logic of the Web to its extreme by constantly re-using the same photographs from other pornographic Web sites. Only rare sites featured the original content. On any given date, the same few dozen images would appear on thousands of sites. Thus, the same data would give rise to more indexes than the number of data elements themselves.” (181)

My opinions on Manovich have shifted somewhat since discussing the shortfalls of his “database and narrative are natural enemies” (181) formula in light of responses surrounding Ed Folsom’s “Database as Genre,” but Mayer-Schönberger and Cukier’s Big Data compels me to wonder how Borges’s story would be “re-written” for Big Data.

Mayer-Schönberger and Cukier argue Big Data is defined by three features: there’s “More” data, it’s “Messy,” and it privileges “Correlation” (what) over causation (why). I bring up Borges’ “On Exactitude in Science” in response to the second feature: “a willingness to embrace data’s real-world messiness rather than privilege exactitude” (19). Ironically, the authors are consistenly inexact about this exactitude-messiness distinction. To illustrate, they compare three different scenarios, two non-Big Data, and one Big Data, “to measure the temperature in a vineyard” (34).

Scenario 1 (Non-BD): “If we have only one temperature sensor for the whole plot of land, we must make sure it’s accurate and working at all times: no messiness allowed” (34).

This scenario is misleading. No system is 100% efficient, no matter how accurate or consistently operational. If you have one temperature sensor for a “whole plot of land,” your reading’s accuracy relative to the whole is never going to be 100% accurate, and will vary depending on the size of the plot, whose temperature is not going to be consistent throughout. This is the same mistake people make when they listen to “a 20% chance of showers” forecast. It actually means there is a 20% average probability of rain in all the subregions analyzed. So in one neighbourhood of your city there could be a 8% chance of rain, and 80% in another, but there is no way of knowing from the shorthand recited to you by newscasters. Moreover, the kind of sensor used is going to matter; Wikipedia lists six different kinds of “temperature sensors”; the authors don’t even specify if we’re dealing with a mechanical instrument or an electrical one. This additional information matters; the authors shouldn’t be using such a hypothetical scenario, especially when they’re trying to inform us about differences in information yield across methodologies.

Scenario 2 (non-BD): “In contrast, if we have a sensor for every one of the hundreds of vines, we can use cheaper, less sophisticated sensors (as long as they do not introduce a systematic bias). Chances are that at some points a few sensors may report incorrect data, creating a less exact, or ‘messier,’ dataset than the one from a single precise sensor. Any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture. Because this dataset consists of more data points, it offers far greater value that likely offsets its messiness” (34).

First, the authors imply that because the sensors are less sophisticated, more of them will report incorrect data. The only reason we did not see this phenomenon in Scenario 1, then, is because we only had one sensor, which, relative to the other elements in its set of one, was 100% accurate. However, as discussed, in reality no sensor is 100% accurate. Therefore this bit about incorrect data amounts to only a general statistical observation, that a large set of objects is likely to include variations—something to which the “single precise sensor,” as one sensor among a group of many other possible choices of sensor, is no less subject; the authors just didn’t think this was the case because they were blind to the blindspots of that sensor’s precision which they meaninglessly decided to assess self-relatively as 100%. By adding (relative to Scenario 1) two variables at once, more sensors AND less sophisticated sensors, the authors needlessly confuse a relative loss in sensory quality with the constant loss of sensory quality (noise, blindspots) inherent to any information system. This produces the illusion that the resulting data is “less exact” or “messier” only because of a loss of quality of the sensors, rather than also (and nonetheless) because more information (from more sensors) now needs to be made sense of. As a result, the “incorrect[ness]” of “[a]ny particular reading” is overstated as an absolute value instead of as a value relative to all other readings in the dataset, and the “aggregate of many readings” is overstated as “more comprehensive” without defining “comprehensive” (they seem to use it as a synonym along with “precise” and “accurate”) and therefore without also specifying the increased level of abstraction (again, the constant of noise) resulting from increased aggregation: we are no longer dealing with one data point, absolutely stated, but a measure of central tendency abstracted from multiple and relative data points—a measure, therefore, no less subject to blindspots however more “comprehensive.” This results in meaningless or messy statements like the biased “as long as they do not introduce a systematic bias” (their lesser degree of sophistication would itself already constitute a systematic bias!), and “greater value…likely offsets its messiness.”

Scenario 3 (BD) is Scenario 2, plus an increased frequency of sensory readings. Again, loss of quality arising from degree of sensory sophistication (here, they also add in issues of information transmission, which amounts to the same thing, just at a different point in the process), are conflated with issues arising from perspectival limitations, to the suppression of the latter. The authors state that, because of noise, “[t]he information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude” (34). It is true that a better ratio of information to noise results in a clearer message, but this has nothing to do with forgoing the chimera of “strict exactitude”; it is only the same situation we’ve been in all along, with each scenario. The “strict exactitude” of the first scenario only arises because both “volume” and “exactitude” are taken to equal one. The authors are so noisy in communicating their ideas that they miss their own information theoretic joke: “[t]he information will be a bit less accurate” is more than one bit inaccurate.

They then conclude about Scenarios 2 and 3 as follows: “In the first example, we sacrificed the accuracy of each data point for breadth, and in return we received detail that we otherwise could not have seen. In the second case, we gave up exactitude for frequency, and in return we saw change that we otherwise would have missed” (35). By “breadth,” they mean sample size: more sensors meant more data; but by “frequency,” don’t they also mean sample size? More sensors measuring more frequently means a larger sample (or higher sample rate). It thus seems to me that, contrary to their three special features of Big Data, the only difference between these three scenarios, as they’ve articulated it, is sample size. (In this, I agree with Ted Underwood that Big Data is about scale and scale is relative. So, just as Digital Humanities will one day be called humanities, Big Data will probably just be called Data, once Earth becomes but one data point among many.) And even then, they misunderstand sample size, because in Scenario 1, they think that a sample size of 1 will lead to the most “precise,” “accurate,” or “exact” results (terms which, so interchanged, cease to describe themselves).

Even when, as they elsewhere claim, Big Data measures a population, not a sample (a purview not unique to Big Data), it can only do so with the same biases inherent to any information system. Even if you’re measuring a population, the law of induction dictates that you will never know when you’ve actually obtained a “complete” population (you never will except arbitrarily), that population will only measure a sample of parameters, from a sample of endless possibilities. While the authors at least acknowledge that “messiness is not inherent to big data” but “a function of the imperfection of the tools we use to measure, record, and analyze information” (41), they also make claims like “Moving into a world of big data will require us to change our thinking about the merits of exactitude” (40) (maybe admitting the inexactitude of human methods has shock-value for the popular and business audience he’s writing for…), “Big data transforms figures into something more probabilistic than precise” (35) (statistics in general is a probabilistic science), “Processing big data entails an inevitable loss of information” (45) (information loss is inevitable, period), and “In many areas of technology and society, we are leaning in favor of more and messy over fewer and exact” (42) (fewer and exact are relative terms, and just as messy depending on how they’re used).

But lastly, most problematically, this kind of rhetoric and lack of understanding of the inherent biases of perspective and information systems, leads the authors to grandiose and paradoxical claims about Big Data’s access to “the totality of information,” though through “slackness instead of exactness” (47); that “Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy” (48); and that as it “become[s] a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of n=all of the mind” (48).

John Law, in an article on messiness called “Making a Mess With Method,” would characterize such viewpoints as only typical of the common-sense realist, who assumes that there is a world “out there” that is definite, singular, and anterior and indepedent to any viewpoint on it (5-6). But in reality, things are messy, and some “things are at least sometimes vague and can only be known vaguely” (5).

This all makes me think that the Big Data version of Borges’ “On Exactitude in Science” may be simply the same story: n=all.

Non-Digital Works Cited

Mayer-Schönberger, Viktor and Kenneth Cukier. Big Data: A Revolution That Will Change How We Live, Work and Think. New York: Mariner, 2014. Print.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Chicago: University of Illinois Press, 2011. Print.

Advertisements
On Exactitude and Messiness in (Big) Science (Data)

What is Reading, if not Proximity?: From Hermeneutic Codes to Sensory Hallucinations

In Graphs, Maps, Trees, Franco Moretti’s definition of distant reading in opposition to close reading is more provocative or playful than actual or, even, polemical; contrary to the misconceptions of many defensive close readers, as Ted Underwood puts it the binary is not a real choice or debate; Andrew Piper’s medial term “scalar reading” is useful and could be cited more (Piper 382). However, since we’re dealing with a binary opposition, I think in clarifying either term we need to question an assumption common to both: What is reading? This axiomatic concept is so close to literary scholars and subjects of a graphic culture in general that its assumptions are hard to see; below, I compress a history of relatively gigantic leaps (bearing out any rabbit holes, Alice, with a shrinking potion à la Moretti: “distance is…not an obstacle, but a specific form of knowledge: fewer elements, hence a sharper sense of their overall interconnection” [95, italics his]) in order to question Western visual/sensory assumptions of reading in light of distant reading’s turn towards information visualization.

The hermeneutic practice of close reading, or interpreting a text for self-enclosed meaning (“the capacity of a critical language to substitute itself for another language, to say x actually means y” [Piper 380]), has been the mode of literary crticism. Far beyond the New Critics of one hundred years ago, the scholarly practice of close reading an exclusively small canon of literary texts can be traced back at least some 2000 years to the practice of transcribing and interpreting what was then the only literary text considered to be important: the Bible. More recently, close reading especially in an English context emerges in the sixteenth century from the Protestant Reformation, which in conjunction with the new technology of printing made it possible to distribute God’s Word in the common tongue for interpretation beyond the priestly. Nonetheless, criticism of literary saints, until (but still) very recently has been the exclusive right of academic priests gatekeeping the path to conversion from signifier x to transcendent signified y.

Lost in this history, however, is an alternative philosophy of reading emerging from the Jesuit Counter-Reformation. “[I]n reply to the new Protestant medium of the letterpress,” writes Friedrich Kittler, the Jesuits employed sensually descriptive poetry adapted from the scriptures, “a theater of illusions for all five senses (although the sense of vision took absolute priority in all of the spiritual exercises),” in order to engage a “reading practice for readers who did not stick to the letter but rather experienced its meaning immediately as a sensual hallucination” (78). Likewise nor did the Jesuits use “icons or panels on a church wall” or “miniatures” representing a Biblical story; rather, they sought to create “psychedelic visions” to experience the (often painful) story themselves, for example the Stations of the Cross or the flames of Hell (78). Thus “[i]t was a new kind of image worship, which, like the hallucinatory readings, was not directed at the image, but rather at its meaning” (79). This is not reading x for y, transposing one language for another, but of experiencing a lived reality of x through/as x itself (the signifier as meaning). By the same stroke, the Jesuit “elite” engaged in a lived writing: they “worked over weeks and months with all possible mortifications of the flesh to actually achieve hallucinations” (79).

This emphasis on experience is different from the one lamented by those who fear distant reading as an encroachment on close reading. Stephen Marche, whose misuses of the word “data” and “information” I responded to in my previous post, argues that “literature is not data. Literature is the opposite of data” using such phrases as “The experience of the mystery of language is the original literary sensation”, “Meaning is mushy”, and “The very first work of surviving literature [The Epic of Gilgamesh] is on the subject of what can’t be processed as information, what transcends data [i.e. ‘the ineffable’].” These ideas of literary experience are more Protestant (or pre-Reformation Catholic) than Jesuit: while Marche advocates for reading for subjective experience and subjective meaning, that experience still involves reading for meaning – the sacred “mystery” and “original” of a “transcend[ent]” signified. Likewise, the ineffable – what exceeds (linguistic) data’s ability to represent (which Marche wrongly calls “information”) – is different from something (i.e. some data) that can’t be processed as (sensory) information, i.e. at all. Since language always fails as perfect representation (because data never translates into 100% information), the experience of language as “ineffable” doesn’t need to be elevated to something “mysterious” if “mysterious” can be reduced to “uncertain,” i.e. a bit (see previous post). If we don’t read with the expectation of being brought from x to y in the first place, then the ineffable is nothing more than the banal experience of language’s everyday inadequacy. Hence the difference with Jesuit reading: the Jesuits processed language data not as sensory information coding/representing another set of linguistic data, but as sensory information. There was no “ineffable” insofar as, to their minds, graphs (whether graphemes or graphics) could conjure the flames of Hell, 1:1.

Now that I’ve mentioned graphs, maybe you sense where I’m going with this. In today’s culture of information visualization, is writing likewise undergoing a (counter-)reformation? Perhaps the important shift marked by “distant reading” is not so much in the “distant” part as in the reading. In Jean Baudrillard’s words, “ ‘virtual’ text (the Internet, word-processing)” is “work[ed] on…like a computer-generated image, which no longer bears any relation to the transcendence of the gaze or of writing….[A]s soon as you are in front of the screen, you no longer see the text as text, but as an image” (76). This very “text” you are reading is only the output of another underlying code designed to draw that text. Languages like HTML and CSS are instructions which tell the graphic user interfaces of Internet browsers how to draw text, images, layouts, etc. So, while the purpose of these languages is, to the computer, hermeneutic (translating from x to y), it’s so limited a hermeneutic relation as to be a misnomer since the computer doesn’t engage in the ambiguity of multiple critical readings; moreover, the purpose of these languages for the human user is not hermeneutic: they are languages designed to draw images, not to be read (except by a different kind of reader: a programmer coding or decoding them to figure out how the computer is unambiguously interpreting them). In this sense, today’s texts, and any hermeneutic engagement with them, already occurs at some level of what Kittler calls Jesuit “hallucination”. Moretti’s graphs, likewise, are graphics calculated and drawn by computer-code graphemes not themselves present in his book itself. Perhaps these graphics, then, reflect not only a shift in scale of hermeneutics (to the macroscopic), but a shift in writing/reading practice: from a rhetoric of representation to a rhetoric made more convincing through increasingly direct sensory engagement. (Cf. my discussion of an augmented-reality library database in Localities.) Piper, although he contextualizes his own reading practice in differentiation from the fifth-century Augustinian religious conversion model of reading (382, 384), describes his topological visualizations of Goethe’s corpus in a way reminiscent of Jesuit reading practice: while “reading is always simultaneously a practice of visual interpretation” as well as “decoding”, “topology undoes the binary distinction between text and illustration and rethinks text as illustrative” (388).

In a broader sense, this “shift,” however, may only reflect the broader biases of a “writing” culture. Jonathan Sterne deconstructs the written-culture/orality-culture binary by showing how “orality is not a very good description of non-Western, non-industrial cultures”: “There were technologies prior to writing that served some of its functions. Painting and sculpture externalized memory and solidified institutional forms over time. Musical instruments and musical technique were disciplines of the body that subordinated collective communication to abstract codes, even if they were not semantico-referential codes like those of writing” (220, 221). It was only colonial rhetoric that promoted “writing” as a superior cultural marker, and largely because of Biblical logocentrism that we came to view writing as a self-enclosed media form.

Writing practices have always been but one medium interacting among any number of other cultural forms of collective cultural hallucination. Close reading’s faulty assumption is not only its hyper-closeness to particular literary texts, but its hyper-closeness to writing as an exclusive and exclusively representational medium. Distant readings are no less prey to Western assumptions about reading, but their use of other graphics beyond the grapheme gestures towards writing not just as a representational rhetoric but as a directly sensory one that has precedent not only in sixteenth-century Jesuit Counter-Reformational practice, but in myriads of multimedia cultural forms both pre- and post-“literate”, Western and non-Western.

Works Cited

Baudrillard, Jean. The intelligence of evil or the lucidity pact. [2004] Trans. Chris Turner. New York: Berg, 2005.

Moretti, Franco. Graphs, Maps, Trees: Abstract Models for a Literary History. London and New York: Verso, 2005. Print.

Piper, Andrew. “Reading’s Refrain: From Bibliography to Topology.” English Literary History 80 (2013): 373-399. Web.

Sterne, Jonathan. “The Theology of Sound: A Critique of Orality.” Canadian Journal of Communication 36 (2011): 207-225. Web.

What is Reading, if not Proximity?: From Hermeneutic Codes to Sensory Hallucinations