Monday, April 1, 2013

From scanned PDF to OCR to Word to ebook: not so fast, mate

I have been trying to explain to a friend why she can't just take a scanned PDF of a book, get it converted into a Word document with Optical Character Recognition (OCR), and send it straight to Smashwords.

Now, Smashwords is fantastic, and every author should check it out. Starting with your Word document, the Smashwords software (which bears the splendid name of Meatgrinder) converts a single source manuscript into every popular form of ebook, including epub, iBook, Kindle and of course the familiar PDF.

And OCR is fantastic. Who would have imagined  that software could virtually read photographs of the pages of a novel and translate the squiggles into pages of words.

However, OCR has its limitations.

Every few months when I have a spare minute, I puddle away at preparing my backlist for distribution as e-books. My plan is to put the old novels up for free—no hurry, though.

Yesterday The Limits of Green had my attention. That was my first novel, published by Penguin in 1985. Below is a sample of how it emerged from the OCR treatment. Characters were indeed recognised, but not always correctly. OCR is more like willing puppy than an expert.



I see at least 13 errors in 16 lines, not counting the formatting. Yes, a human hand is needed—and I am happy about that. I'm thoroughly enjoying my copyediting chore because this involves re-reading a novel I wrote in my youth. Such exuberance! Such reckless confidence! Such mobile-phone-less ingenuity! I look forward to releasing this baby into the wild.


No comments:

Post a Comment