The Deliverator – Wannabee

So open minded, my thoughts fell out…

The perpetually sucky state of non-destructive book scanning

Posted by Deliverator on February 7th, 2012

Every few years I find myself in the unenviable position of unavoidably needing to non-destructively scan a book. Every few years I pray that someone has come up with an affordable, reasonably quick way of doing this that produces good quality results. Every few years I burn an evening researching the state of the field. Every few years I come away disappointed. Here are my observations from this go around:

Hardware:

Sheet Feeders – If you can afford to destroy the book, you can cut off the binding with a fine toothed band-saw or other power tool of your choice and feed the pages through a sheet-feeder style scanner. Sheet feeders like the popular Fujitsu ScanSnap series can scan both sides of each pages at something like 20 pages per minute at 600ish DPI. This is mighty impressive as it cuts the actual scanning time for a book down to something like a half hour. Unfortunately, when I find myself in the position of needing to scan a book, it is usually some rare tome it took me 2 years and $300 to find on AbeBooks. For my purposes, solutions requiring a band-saw need not apply. Also, many of the better scanners cost $400+, which is pushing what I would consider affordable.

Commercial Copy Stands range from simple single overhead camera rigs to more complex dual camera rigs with adjustable cradles to support the book without damaging it, re-positionable lighting, non-reflective glass to hold the pages flat, automatic page flippers, etc. Commercial, off-the-shelf solutions from companies like Atiz can run $14k+ without even factoring in the cost of cameras (typically high end Canon DSLRs). Great, if you are a university library spending grant money, sucky if you are a book nerd on a budget.

DIY Copy Stands – A substantial percentage of the functionality, speed and quality of these rigs can be replicated for under $1000 by building your own dual camera copy stand following one of the several increasingly standardized designs from the DIY Book Scanner project. This is still more time and money than I want to spend and probably more space than I want to waste for a device I would only very seldom use. When a full, well documented/supported, single evening kit is available for under $300 plus the cost of cameras, I will probably be interested. The BookLiberator looked to commercially produce kits that would meet all my requirements, but efforts to produce the units fell apart after Ion Audio announced its similar sub-$200 BookSaver product at CES 2011. Ion has since VERY quietly pulled the plug on the BookSaver without ever selling any, but their initial product announcement was enough to send most small, independent efforts to produce a similar device scurrying for someplace small and dark.

Flatbed scanners are an inexpensive, mature, widely used technology which suck at scanning books in a wide variety of ways. First, most flatbeds tend to be optimized for high quality scans of things like photos, not for speed. Secondly, most scanners have a significant bezel around the scanning platen, meaning the only way to scan a book is to significantly bend/distort the spin in order to get the pages to lie flat against the glass. Even mashed against the glass, you usually get significant page distortion near the binding resulting in curving text and uneven illumination.

Several years ago I purchased a Plustek OpticBook 3600 plus, a flatbed scanner specially optimized for scanning books. The OpticBook has a very thin bezel along one edge of the platen which lets you open a book to a 90 degree angle and have one page flat against the glass while the other hangs freely over the side. This lets you produce an undistorted scan of a page without significantly bending the spine. The “DigiBook” software included with the scanner has an automatic page rotation feature, so that every other page gets rotated 180 degrees. This lets you scan a page, flip the book over to scan the opposite page and have everything automatically rotated the right way. There are giant over-sized buttons on the scanner that let you trigger a scan in B&W, greyscale or color. The actual scan takes 5-8 seconds, as the scanner is optimized for speed, rather than highest possible DPI.

The OpticBook concept is very nice in theory, but the implementation leaves something to be desired. Even with the scanner bezel as thin as it is, the scanning element doesn’t get close enough to the binding to scan most paperbacks. It works fine for hard covers like textbooks, where the content doesn’t start as close to the binding. The software is also very crash prone and the work-flow somewhat less than ideal, with the operator having to hit a “transfer” button in the software after each page to write the image out to disk, despite the over-sized buttons on the scanner itself. Anything that adds 5-10 extra seconds to the work-flow gets multiplied tremendously over a 500+ page book. These scanners are also very poorly sealed, with significant dust accumulating on the interior of the glass plate with no easy way to clean it short of disassembly. There doesn’t appear to be a way to adjust the lamp brightness, so you tend to get a bit of bleed through from text on the other side of pages you are scanning. Many users also complain of short bulb life, although my unit is still functional. From reviews I’ve read, I am not convinced that Plustek has learned much from their mistakes in successive models in this series.

Handheld Scanners – I’ve never been impressed with the quality of the results from hand-held “wand” scanners. I haven’t personally checked out any of these devices in years, as I’ve largely consigned the whole category into Sharper Image / SkyMall crap-gadget territory. If someone wants to tell me that X device can quickly and accurately scan a paperback, I may look into these in the future.

Software:

Post Processing – While hardware has seen little improvement since my last review, there have been some improvements on the software side of things. The DIY Book Scanner project has yielded a plethora of scripts, tools, etc. for packaging up scans into various digital book formats. Of these, I have found a tool called Scan Tailor to be the most polished, easy to install, and to use. Scan Tailor will take a directory filled with scanned images and will straighten, deskew, remove background and bleed through (to give you black text on a pure white background), set a constant page size/margin, etc. Scan Tailor will work almost completely auto-magically through each step of the process and if it does make a mistake, it is easy for the user to intervene and apply a manual correction. Scan Tailor cut my workflow from previous years of 6-8 programs and scripts (each with fussy dependencies on libraries and frameworks) down to 3. I still do some post processing of scans in irfanView and Scan Tailor doesn’t do the final bundling of images into PDFs, DJVU, etc. or do OCR, but other than that it is pretty much a one stop shop for post scan image processing.

Binding – Once you have a directory full of post processed images, what are you going to do with them? I am still using Presto Pagemanager 7.10 for assembling my post processed TIFF images into PDFs. It isn’t ideal in many ways, but has the virtue of not costing me anything more and working consistently, if in somewhat of a hurky-jerky liable to temporarily freeze Explorer kinda way. I played around with a half dozen PDF/DJVU binder scripts/programs recommended by the book scanning forums and basically concluded that the free options all royally sucked in one way or another, not the least of which is requiring me to install 5 different programming frameworks just to try them out. Scan Tailor is a lovely, consistent, unified application that is easy to install and use. The DIY Book Scanner community could really use something as well done for the binding stage of the process. As it is, one is left to fend with a gobbledygook of unmanageable python scripts, ruby scripts, feeding various Unix command line utilities and throwing an undocumented fit anytime it finds something not to its liking. The situation is marginally improved if you want to output DjVu files rather than PDFs, but only marginally.

OCR – So, you want to turn those post processed scans into a re-flowable format like .epub for easy reading on your ebook reader device? You are kidding me, right? OCR is one of those things that has been around since the dawn of scanning and despite a lot of protestations seems to have changed little. If you asked me about the state of OCR 5 or 10 years ago I would have told you there is Omnipage & Abbyy Fine Reader & everything else. Today, that still seems to be the state of the industry. I tried a half dozen of the everything else variety including OpenOCR (Cuneiform), VietOCR and TiffDjvuOCR. Most of the free solutions seem to use Tesseract, an open source OCR engine from Google. Across 3 books with straightforward, single column formatting and commonly used fonts, I found the free OCR packages basically good enough to create a rough keyword index for searching books, but nothing near the accuracy to create a readable, reflowable ebook without significant time spent correcting errors. I concluded I might actually be able to retype a book faster and more accurately than if I tried to correct all the strange and easily unspotted errors committed by OCR. I would be curious to try the commercial packages at some point, as a lot of book scanners seem to swear by recent versions of Abbyy Fine Reader, but I’m not really in the mood to spend $150+ to fart around with either of the commercial offerings.