I’m doing research for my anthology project (internal link). It requires scanning old newspapers and books and converting the resulting text images into manipulable files. Optical character recognition hasn’t progressed much over the years, at least at the consumer level.
Take a look at this image. It would make any OCR program throw up:
Here’s how an inexpensive program called Elucidate converted it:
“h «Ila! daalaaauaa nauaa a «Imago lav aa Ia “can?”St oIlI»aluminadnllarInnauonbaabomlrdmmum . M. at has In”. NH Mun-an hon-ta an rhamng aa mm It Iasy5Labortaam!map“hatup:lhcpm“naIamaa- !Ittll)‘ lush mleml charm-a. Mn anybody I talk Io‘ Fer ham. form Japan. \‘od amply cannot pay Irma pram.Than M an the Japan-u. mm «(mm ta In Ian M Fm panama: a on hotel Garment. Toll ma: “TunaJanina—I’d met arc-n bum-um to but at out an lud- all mm. mm ha uaa W. In all: “I’ll buy an small? ” “A; Java” beam-\me an m lama: an the Soon: Padnt. TM an a mum: 9am to: [our hundred ma my“ tam. mm”. to but) alumnus that lea lake any but a! mammal.
This wasn’t really a fair test. Even Adobe Acrobat’s $500 software would give up. But there’s a way around it, at least to a degree. Dictate it using Word’s built in dictation feature. I’m a beginner at using this tool but the results below are what I got on a first try. Although there is a great deal to clean up, at least the problem is approachable:
Value weighs seven making a difference for us in Mexico.
Not the slipping dollar. Inflation has boosted prices just like it has here. New Mexican hotels are charging as much as you. S. Labour is still cheap what UPS the price is fantastically high interest charges
Front everybody I talk to: Forget friends. Forget Japan. You simply cannot pay those prices.
The new rich are the Japanese. Friend of mine is in town from Fiji twisting a new hotel development Area told me: quote this Japanese– I’ve never seen before–asked to look at our new 18 hole golf course. When he was through he said I’ll bite how much
Paragraph 50 Japanese businessmen are now touring all the South Pacific. They are a scouting party for 400 coming later I Just jacked up: Goodbye anything that looks like any kind of investment.
Typing out the newspaper clipping might be just as fast as rescuing Word’s results, but how much typing can you do in one day? Do you really want to type out all the articles you collect?
I’m also trying Newspapers.com (external link) for a month. Their OCR software does a pretty good job. They may have found a cleaner copy of the newspaper that what I have in the image above. Here’s what their OCR finding looks like:
“Is dollar devaluation making a difference for us in Mexico?” Not the slipping dollar I n f l a t i o n h a s boosted prices just l i k e it has here. New Mexican hotels are charging as much as U. S. Labor is still cheap. What ups the prices is fantastically high interest charges. From EVERYBODY 1 talk to: Forget France. Forget Japan. You simply cannot pay those prices. The NEW rich are the .Japanese. Friend of mine is in town from Fiji pushing a new hotel development. Told me: “This Japanese — I’d n e v e r seen before — asked to look at our new 18-hole golf course. When he was through, he said: ‘I’ll buy it. How much?’ ” F i f t y J a p a n e s e businessmen are now touring ALL the South Pacific. They are a scouting party for FOUR HUNDRED more c o m i n g l a t e r . Objective: To buy ANYTHING t h a t looks like ANY kind of investment.
On the positive side of things, I am getting good results by scanning and converting my documents with Scanner Pro 7. (external link) It’s an app for the iPhone. Although it still can’t process mush like the above example, it does a much better job than my flatbed scanner. And it’s portable, of course, so I can use it easily at a library. I can even take images off my desktop computer.
Anyone out there have any OCR tricks? Let me know.