Proofreading, E-book Typos, and Project Gutenberg
Leave a commentFebruary 27, 2014 by libroshombre
“Self-editing is the path to the dark side,” warns writer Eric Benoit. “Self-editing leads to self-delusion, self-delusion leads to missed mistakes, missed mistakes lead to bad reviews.” I entered the dark side when I deluded myself into sending an early draft of last week’s column to the News Miner, instead of the smoother, more cogent final version, that’s obtainable by emailing me at hillofbooks@gmail.com.
Proofreading’s among my biggest complaints about e-books. Up there with “Why can’t I bequeath someone else the e-books I’ve purchased?” is “Why are even newly-released e-books riddled with typographical errors?” “C” appears as “e”, “j” as “i,” words are jumbled, etc. The older the work, the worse it is. I bought several 99-cent copies of Montaigne’s “Essays,” published in the 1500s, for my Nook before finding a version that was readable. The others had entire chapters all scrambled up.
The problem’s the same even with newly-printed e-books, albeit on a lesser scale. In “Why Are E-books Riddled with Typos?,” a Forbes.com article, Tim Worstall identified the surge of self-published books as a major culprit, since “almost no self-publishers are passing their work under the nose of an editor.” The authors often believe they can proof their own work, but the human mind’s wired to breeze through its own writings, mentally transposing the correct words onto existing errors. You can try techniques like reading your work backwards, sentence-by-sentence, but it’s always better to bring in another brain and set of eyes.
What really irks Worstall and me is when brand new e-books are rife with typos, which happens frequently when print is digitized for computers. This involves Optical Character Recognition, or OCR, software that’s used to scan print books pages, recognize the images as words, and then convert the scanned words into text files. A study in D-Lib Magazine, an online publication “dedicated to digital library research and development,” said “most OCR software claims 99% accuracy,” while the average standard for OCR accuracy is 90-98%.
Doesn’t sound like much? There are between 2,500 and 4,000 characters on an 8-by-11 page, meaning there are potentially OCR 200 errors. Even a 99% accuracy rate allows 40 mistakes. On every page. As Worstall complained, the publishers “aren’t paying enough attention to the production process.” The errors in new e-books boil down to publishers not employing enough humans to spot the problems.
Not so with Project Gutenberg, who has a system for producing much cleaner e-books, thanks to help from a non-profit outfit called Distributed Proofreaders. Project Gutenberg’s an archive of 44,890 out-of copyright titles that are readable and free.at Gutenberg.org. It’s the brainchild of Michael Hart, the inventor of e-books, whose dream was to preserve the great written works of human-kind and make them freely available though the Inte
rnet. Hart spoke at Noel Wien Library a few years before he died in 2011, and it was one of the more free-wheeling lectures ever made there, which is saying something.
Hart’s brother’s best friend worked on the new Xerox Sigma V mainframe computer at the University of Illinois, one of the nodes of the spanking new Internet, and the friend finagled for Michael an unlimited account on the Xerox, which has since been estimated to have been worth from $100,000 to $100 million. Hart conceived of Project Gutenberg as a way to “give back” to society by creating a database of “the 10,000 most consulted books.” He did that and more with an army of volunteers from Distributed Proofreaders.
Distributed Proofreaders sends each volunteer proofer a scanned page from a book along with the OCR version of the same page. The volunteer makes any necessary changes to the OCR version and sends it on to another volunteer proofer who repeats the task. Finally it’s sent to a “post processor” who reviews the work and compiles the pages into a book. This allows for closer scrutiny than reading theundiluted OCR version alone, and it’s easier to spot errors when reading pages out of context. It also permits scores of people to work on a single book simultaneously.
As for future columns, I’ll double-check them with the proofreader I live with before submission, following Shakespeare’s advice: “Be sure of it; give me the ocular proof.”