On Exactitude and Messiness in (Big) Science (Data)

The stories of Jorge Luis Borges, who tellingly spent most of his life as a librarian, read like thought experiments (Stephen Ramsay would approve: “an impossible fantasy constructed for the purpose of divining the possibilities of the real” [23]), and seem to crop up again and again when discussing matters of information, representation, and hypertextuality. As I’m reading Viktor Mayer-Schönberger and Kenneth Cukier’s Big Data, the Borges story that comes to mind is “On Exactitude in Science,” a paragraph-long fable you can quickly read if you haven’t. Jean Baudrillard uses the story to open Simulacra and Simulation, an appropriately fake copy of which appears in the Waichowskis’ The Matrix (where Neo keeps a cracked disc he sells on the black market), which also borrows from Baudrillard’s summary the phrase “Desert of the Real,” subsequently used by Slavoj Zizek in Welcome to the Desert of the Real. In his widely cited article “Database as a Genre of New Media,” Lev Manovich reworks the story for databases:

“Jorge Luis Borges’s story about a map which was equal in size to the territory it represented became re-written as the story about indexes and the data they index. But now the map has become larger than the territory. Sometimes, much larger. Pornographic Web sites exposed the logic of the Web to its extreme by constantly re-using the same photographs from other pornographic Web sites. Only rare sites featured the original content. On any given date, the same few dozen images would appear on thousands of sites. Thus, the same data would give rise to more indexes than the number of data elements themselves.” (181)

My opinions on Manovich have shifted somewhat since discussing the shortfalls of his “database and narrative are natural enemies” (181) formula in light of responses surrounding Ed Folsom’s “Database as Genre,” but Mayer-Schönberger and Cukier’s Big Data compels me to wonder how Borges’s story would be “re-written” for Big Data.

Mayer-Schönberger and Cukier argue Big Data is defined by three features: there’s “More” data, it’s “Messy,” and it privileges “Correlation” (what) over causation (why). I bring up Borges’ “On Exactitude in Science” in response to the second feature: “a willingness to embrace data’s real-world messiness rather than privilege exactitude” (19). Ironically, the authors are consistenly inexact about this exactitude-messiness distinction. To illustrate, they compare three different scenarios, two non-Big Data, and one Big Data, “to measure the temperature in a vineyard” (34).

Scenario 1 (Non-BD): “If we have only one temperature sensor for the whole plot of land, we must make sure it’s accurate and working at all times: no messiness allowed” (34).

This scenario is misleading. No system is 100% efficient, no matter how accurate or consistently operational. If you have one temperature sensor for a “whole plot of land,” your reading’s accuracy relative to the whole is never going to be 100% accurate, and will vary depending on the size of the plot, whose temperature is not going to be consistent throughout. This is the same mistake people make when they listen to “a 20% chance of showers” forecast. It actually means there is a 20% average probability of rain in all the subregions analyzed. So in one neighbourhood of your city there could be a 8% chance of rain, and 80% in another, but there is no way of knowing from the shorthand recited to you by newscasters. Moreover, the kind of sensor used is going to matter; Wikipedia lists six different kinds of “temperature sensors”; the authors don’t even specify if we’re dealing with a mechanical instrument or an electrical one. This additional information matters; the authors shouldn’t be using such a hypothetical scenario, especially when they’re trying to inform us about differences in information yield across methodologies.

Scenario 2 (non-BD): “In contrast, if we have a sensor for every one of the hundreds of vines, we can use cheaper, less sophisticated sensors (as long as they do not introduce a systematic bias). Chances are that at some points a few sensors may report incorrect data, creating a less exact, or ‘messier,’ dataset than the one from a single precise sensor. Any particular reading may be incorrect, but the aggregate of many readings will provide a more comprehensive picture. Because this dataset consists of more data points, it offers far greater value that likely offsets its messiness” (34).

First, the authors imply that because the sensors are less sophisticated, more of them will report incorrect data. The only reason we did not see this phenomenon in Scenario 1, then, is because we only had one sensor, which, relative to the other elements in its set of one, was 100% accurate. However, as discussed, in reality no sensor is 100% accurate. Therefore this bit about incorrect data amounts to only a general statistical observation, that a large set of objects is likely to include variations—something to which the “single precise sensor,” as one sensor among a group of many other possible choices of sensor, is no less subject; the authors just didn’t think this was the case because they were blind to the blindspots of that sensor’s precision which they meaninglessly decided to assess self-relatively as 100%. By adding (relative to Scenario 1) two variables at once, more sensors AND less sophisticated sensors, the authors needlessly confuse a relative loss in sensory quality with the constant loss of sensory quality (noise, blindspots) inherent to any information system. This produces the illusion that the resulting data is “less exact” or “messier” only because of a loss of quality of the sensors, rather than also (and nonetheless) because more information (from more sensors) now needs to be made sense of. As a result, the “incorrect[ness]” of “[a]ny particular reading” is overstated as an absolute value instead of as a value relative to all other readings in the dataset, and the “aggregate of many readings” is overstated as “more comprehensive” without defining “comprehensive” (they seem to use it as a synonym along with “precise” and “accurate”) and therefore without also specifying the increased level of abstraction (again, the constant of noise) resulting from increased aggregation: we are no longer dealing with one data point, absolutely stated, but a measure of central tendency abstracted from multiple and relative data points—a measure, therefore, no less subject to blindspots however more “comprehensive.” This results in meaningless or messy statements like the biased “as long as they do not introduce a systematic bias” (their lesser degree of sophistication would itself already constitute a systematic bias!), and “greater value…likely offsets its messiness.”

Scenario 3 (BD) is Scenario 2, plus an increased frequency of sensory readings. Again, loss of quality arising from degree of sensory sophistication (here, they also add in issues of information transmission, which amounts to the same thing, just at a different point in the process), are conflated with issues arising from perspectival limitations, to the suppression of the latter. The authors state that, because of noise, “[t]he information will be a bit less accurate, but its great volume makes it worthwhile to forgo strict exactitude” (34). It is true that a better ratio of information to noise results in a clearer message, but this has nothing to do with forgoing the chimera of “strict exactitude”; it is only the same situation we’ve been in all along, with each scenario. The “strict exactitude” of the first scenario only arises because both “volume” and “exactitude” are taken to equal one. The authors are so noisy in communicating their ideas that they miss their own information theoretic joke: “[t]he information will be a bit less accurate” is more than one bit inaccurate.

They then conclude about Scenarios 2 and 3 as follows: “In the first example, we sacrificed the accuracy of each data point for breadth, and in return we received detail that we otherwise could not have seen. In the second case, we gave up exactitude for frequency, and in return we saw change that we otherwise would have missed” (35). By “breadth,” they mean sample size: more sensors meant more data; but by “frequency,” don’t they also mean sample size? More sensors measuring more frequently means a larger sample (or higher sample rate). It thus seems to me that, contrary to their three special features of Big Data, the only difference between these three scenarios, as they’ve articulated it, is sample size. (In this, I agree with Ted Underwood that Big Data is about scale and scale is relative. So, just as Digital Humanities will one day be called humanities, Big Data will probably just be called Data, once Earth becomes but one data point among many.) And even then, they misunderstand sample size, because in Scenario 1, they think that a sample size of 1 will lead to the most “precise,” “accurate,” or “exact” results (terms which, so interchanged, cease to describe themselves).

Even when, as they elsewhere claim, Big Data measures a population, not a sample (a purview not unique to Big Data), it can only do so with the same biases inherent to any information system. Even if you’re measuring a population, the law of induction dictates that you will never know when you’ve actually obtained a “complete” population (you never will except arbitrarily), that population will only measure a sample of parameters, from a sample of endless possibilities. While the authors at least acknowledge that “messiness is not inherent to big data” but “a function of the imperfection of the tools we use to measure, record, and analyze information” (41), they also make claims like “Moving into a world of big data will require us to change our thinking about the merits of exactitude” (40) (maybe admitting the inexactitude of human methods has shock-value for the popular and business audience he’s writing for…), “Big data transforms figures into something more probabilistic than precise” (35) (statistics in general is a probabilistic science), “Processing big data entails an inevitable loss of information” (45) (information loss is inevitable, period), and “In many areas of technology and society, we are leaning in favor of more and messy over fewer and exact” (42) (fewer and exact are relative terms, and just as messy depending on how they’re used).

But lastly, most problematically, this kind of rhetoric and lack of understanding of the inherent biases of perspective and information systems, leads the authors to grandiose and paradoxical claims about Big Data’s access to “the totality of information,” though through “slackness instead of exactness” (47); that “Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality than did our dependence on small data and accuracy” (48); and that as it “become[s] a regular part of everyday life, we as a society may begin to strive to understand the world from a far larger, more comprehensive perspective than before, a sort of n=all of the mind” (48).

John Law, in an article on messiness called “Making a Mess With Method,” would characterize such viewpoints as only typical of the common-sense realist, who assumes that there is a world “out there” that is definite, singular, and anterior and indepedent to any viewpoint on it (5-6). But in reality, things are messy, and some “things are at least sometimes vague and can only be known vaguely” (5).

This all makes me think that the Big Data version of Borges’ “On Exactitude in Science” may be simply the same story: n=all.

Non-Digital Works Cited

Mayer-Schönberger, Viktor and Kenneth Cukier. Big Data: A Revolution That Will Change How We Live, Work and Think. New York: Mariner, 2014. Print.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Chicago: University of Illinois Press, 2011. Print.

On Exactitude and Messiness in (Big) Science (Data)

Reading Machines Reading Machines: 100 Trillion Poems and the Scaling of Possibility

In Reading Machines: Toward an Algorithmic Criticism, Stephen Ramsay alludes to Raymond Queneau’s Cent Mille Milliards de Poèmes (100,000,000,000,000 Poems), which consists of 10 sonnets (14 lines each), where each line is interchangable with the same-numbered line from any other of the sonnets. Ramsay explains:

“If there are ten possibilities for the first line, and any one of those lines may be followed by one of ten additional lines, it follows that the number of posssible combinations is (or 102 = 100). Since that number will increase by a factor of ten with each additional line, a fourteen-line poem becomes, in its potential state, the Cartesian product of its sets: i.e., 1014 (100 trillion) possibilities. Queneau determined that a person reading the book twenty-four hours a day would need 190,258,751 years to finish it…” (26)

With its “generative qualities” (Ramsay 28), Queneau’s book seems the perfect monument for Ramsay’s proposal for a textual criticism that blurs boundaries with artistic reinvention and opens texts up to imaginative possibilities and potentialities. He sees algorithms as a way of doing this, and argues that all textual criticism, as a process of rewriting and transforming, has already done this anyway, rhetorically and methodologically, but it just hasn’t been transparent about it, and as a result has restricted possibilities for reading.

But for me Queneau’s book equally stands in for Ramsay’s haziness in defining “possibilities” and “potentialities” beyond idealistic hurrahs. I’m on board with Ramsay’s algorithmic criticism, but I’m not convinced by its unbridled veneration of the possible or potential.

Compare the description of Queneau’s book given by Christian Bök’s dissertation:

“a single reader, reading one [sonnet] a second, must survive for more than a thousand millennia in order to read every poem. Such a book remains inscrutable not because of its illegibility but because of its potentiality. Such a book…deals with the as if of what is possible in virtuality but impossible in actuality” (78)

Both Ramsay and Bök praise Queneau’s book for exceeding the reading capacities of a human life, as a measure of its awesome potential. And yet at the same time, both writers measure the human’s failure to match that potential by a standard of completeness: “read every poem” (Bök) or to “finish it” (Ramsay). This is the human impossibility by which they measure a text’s adequate degree of critical possibility. But this measure of possibility collapses in the face of Ramsay’s own proposed reading method, not to mention Bök’s comparison of Queneau’s book to a “literary computer” (78). For, while the human cannot read this book one hundred trillion times, couldn’t a computer?! “A single [human] reader…must survive for more than a thousand millennia in order to read every poem” but a computer could read the book’s one hundred trillion poems in a matter of minutes or seconds. In fact, it could do even better: it could read the book’s one hundred trillion poems one hundred trillion times. Or more.

This fact quintessentially illustrates reading’s scalability: there can be no distinction between “close” or “distant” in reading Queneau’s book. Even a close reading requires the methods of distant reading. For, what hermeneutic claims about “meaning” can we make for a book for which we have time to read maybe only 10 out of 100,000,000,000,000 of its pages, and only once? We need more readings of more pages, for which we need a computer. Of course, from a poststructuralist perspective, every book is more than the sum of its visible pages – we are always already reading only a small percentage of “possibilities.” But Queneau’s book explicitly makes this a problem – of quantification. It suggests maybe we must turn to statistics, to take a sample size to represent a larger population. The poststructuralism in me says that this would make the mistake (“scientistic,” Ramsay would call it) of treating the work as a potential whole, when in fact every text is inexhaustible. But perhaps that should be revised to something like: every text is inexhaustible, on an infinite timeline. Reading is always going to be incomplete, but to read at all requires some limited sample space of assumed stability. Therefore, as the computer’s speed demonstrates, Queneau’s book is only relatively massive. Our computer’s reading of it would be no different from reading “one” sonnet, for example Shakespeare’s Sonnet 1, one hundred trillion times, or conducting an aggregate study of a hundred trillion academic articles on Sonnet 1 (if such a collection existed). But it would also only be marginally different from reading Sonnet 1 only once. For who is to define the upper limit on Sonnet 1s, if the limit is always going to be relative to our allotted sample space? The New Critics would define one reading of sonnet one as “complete” – a ratio of 1 reading to 1 sonnet (just as, according to Bök and Ramsay’s definition of “completeness”, one sonnet of Queneau’s one hundred trillion would be a complete reading of that one sonnet). Poststructuralists would say no reading is ever complete: 1:n, where n approaches infinity (n possible versions of sonnet 1). But what’s more difficult are the scales in-between, and the serious questions Queneau’s book raises about how we read and what we read for.

To illustrate, consider a simple deformance of Queneau’s text: what if we read each line as interchangeable with every other line, instead of with only every other line in that position (e.g. line 1)? This would result in even more possible sonnets – Bök and Ramsay could praise generative, life-exceeding, hyper-actual, hypothetical possibilities to exponentially greater degrees! But the marginal shift in rhetoric such a deformance enacts suggests that these extra possibilities wouldn’t tell us much more about anything. And the computer could still read them.

Queneau’s book is, as Ramsay suggests, generative. In this sense, the content is less important than the mechanism: 10^14 times less important, perhaps. But then the possibilities, potentialities, and virtualities with which Ramsay and Bök characterize it maybe aren’t that important either. Not as important as the particularities and actualities of one or more readings through it. Reading it poem by poem, as Ramsay and Bök grandiosely metaphorize a hypothetical human reader as doing, is just one algorithm of reading it. Isn’t this precisely the kind of limited, unconscious algorithm Ramsay wants to shift us away from?! Essentially, Queneau’s book is a database, and a database can provide different answers based on different queries. So the question is, given a computer that can read it as much as we want and in many ways, what queries might we ask of this text?

I’m interested in feedback as I’m thinking about elaborating this post into my research paper, which would involve placing Queneau’s book into a MySQL database in order to experiment with this kind of “brute force” reading, as I’ve called it earlier, and to inquire further into how it challenges our assumptions, methods, and goals as readers.

Works Cited

Bök, Christian. ’Pataphysics: The Poetics of an Imaginary Science. Evanston, IL: Northwestern University Press, 2002. Print.

Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Chicago: University of Illinois Press, 2011. Print.

Reading Machines Reading Machines: 100 Trillion Poems and the Scaling of Possibility