Выбрать главу

Similar advances have taken place with 3-D models. On the archive for 3-D models generated in the software SketchUp, you can find insanely detailed three-dimensional virtual models of most major building structures of the world. Need a street in New York? Here’s a filmable virtual set. Need a virtual Golden Gate Bridge? Here it is in fanatical detail, every rivet in place. With powerful search and specification tools, high-resolution clips of any bridge in the world can be circulated into the common visual dictionary for reuse. Out of these ready-made “phrases” a film can be assembled, mashed up from readily available clips or virtual sets. Media theorist Lev Manovich calls this “database cinema.” The databases of component images form a whole new grammar for moving images.

After all, this is how authors work. We dip into a finite database of established words, called a dictionary, and reassemble these found words into articles, novels, and poems that no one has ever seen before. The joy is recombining them. Indeed, it is a rare author who is forced to invent new words. Even the greatest writers do their magic primarily by remixing formerly used, commonly shared ones. What we do now with words, we’ll soon do with images.

For directors who speak this new cinematographic language, even the most photorealistic scenes are tweaked, remade, and written over frame by frame. Filmmaking is thus liberated from the stranglehold of photography. Gone is the frustrating method of trying to capture reality with one or two takes of expensive film and then creating your fantasy from whatever you get. Here reality, or fantasy, is built up one pixel at a time as an author would build a novel one word at a time. Photography exalts the world as it is, whereas this new screen mode, like writing and painting, is engineered to explore the world as it might be.

But merely producing movies with ease is not enough, just as producing books with ease on Gutenberg’s press did not fully unleash text. Real literacy also required a long list of innovations and techniques that permitted ordinary readers and writers to manipulate text in ways that made it useful. For instance, quotation symbols make it simple to indicate where one has borrowed text from another writer. We don’t have a parallel notation in film yet, but we need one. Once you have a large text document, you need a table of contents to find your way through it. That requires page numbers. Somebody invented them in the 13th century. What is the equivalent in video? Longer texts require an alphabetic index, devised by the Greeks and later developed for libraries of books. Someday soon with AI we’ll have a way to index the full content of a film. Footnotes, invented in about the 12th century, allow tangential information to be displayed outside the linear argument of the main text. That would be useful in video as well. And bibliographic citations (invented in the 13th century) enable scholars and skeptics to systematically consult sources that influence or clarify the content. Imagine a video with citations. These days, of course, we have hyperlinks, which connect one piece of text to another, and tags, which categorize using a selected word or phrase for later sorting.

All these inventions (and more) permit any literate person to cut and paste ideas, annotate them with her own thoughts, link them to related ideas, search through vast libraries of work, browse subjects quickly, resequence texts, refind material, remix ideas, quote experts, and sample bits of beloved artists. These tools, more than just reading, are the foundations of literacy.

If text literacy meant being able to parse and manipulate texts, then the new media fluency means being able to parse and manipulate moving images with the same ease. But so far, these “reader” tools of visuality have not made their way to the masses. For example, if I wanted to visually compare recent bank failures with similar historical events by referring you to the bank run in the classic movie It’s a Wonderful Life, there is no easy way to point to that scene with precision. (Which of several sequences did I mean, and which part of them?) I can do what I just did and mention the movie title. I might be able to point to the minute mark for that scene (a new YouTube feature). But I cannot link from this sentence to only those exact “passages” inside an online movie. We don’t have the equivalent of a hyperlink for film yet. With true screen fluency, I’d be able to cite specific frames of a film or specific items in a frame. Perhaps I am a historian interested in oriental dress, and I want to refer to a fez worn by someone in the movie Casablanca. I should be able to refer to the fez itself (and not the head it is on) by linking to its image as the hat “moves” across many frames, just as I can easily link to a printed reference of the fez in text. Or even better, I’d like to annotate the fez in the film with other film clips of fezzes as references.

With full-blown visuality, I should be able to annotate any object, frame, or scene in a motion picture with any other object, frame, or motion picture clip. I should be able to search the visual index of a film, or peruse a visual table of contents, or scan a visual abstract of its full length. But how do you do all these things? How can we browse a film the way we browse a book?

It took several hundred years for the consumer tools of text literacy to crystallize after the invention of printing, but the first visual literacy tools are already emerging in research labs and on the margins of digital culture. Take, for example, the problem of browsing a feature-length movie. One way to scan a movie would be to super-fast-forward through the two hours in a few minutes. Another way would be to digest it into an abbreviated version in the way a theatrical movie trailer might. Both these methods can compress the time from hours to minutes. But is there a way to reduce the contents of a movie into imagery that could be grasped quickly, as we might see in a table of contents for a book?

Academic research has produced a few interesting prototypes of video summaries, but nothing that works for entire movies. Some popular websites with huge selections of movies (like porn sites) have devised a way for users to scan through the content of full movies quickly in a few seconds. When a user clicks the title frame of a movie, the window skips from one key frame to the next, making a rapid slide show, like a flip book of the movie. The abbreviated slide show visually summarizes a few-hour film in a few seconds. Expert software can be used to identify the key frames in a film in order to maximize the effectiveness of the summary.

The holy grail of visuality is findability—the ability to search the library of all movies the same way Google can search the web, and find a particular focus deep within. You want to be able to type key terms, or simply say, “bicycle plus dog,” and then retrieve scenes in any film featuring a dog and a bicycle. In an instant you could locate the moment in The Wizard of Oz when the witchy Miss Gulch rides off with Toto. Even better, you want to be able to ask Google to find all the other scenes in all movies similar to that scene. That ability is almost here.

Google’s cloud AI is gaining visual intelligence rapidly. Its ability to recognize and remember every object in the billions of personal snapshots that people like me have uploaded is simply uncanny. Give it a picture of a boy riding a motorbike on a dirt road and the AI will label it “boy riding a motorbike on a dirt road.” It captioned one photo “two pizzas on a stove,” which was exactly what the photo showed. Both Google’s and Facebook’s AI can look at a photo and tell you the names of the people in it.