Fedora, while including a comprehensive
tesseract set of
rpms, doesn’t have the equivalent of
tesseract-python, so I needed something to build/import easily.
However, quick Google searches offer several solutions that now look inappropriate :
pip install tesseract appears to contain a complete, old
tesseractbuild - it’s 40Mb
pip install pytesseract is GPL3, which is inappropriate for my use-case
tesseractdirectly via commandline embedding
However, reading the
tesseract project’s wiki pages on github indicate that there are several
other choices available, and I (somewhat arbitrarily) chose
tesserocr, which is MIT licensed, and
has a fairly comprehensive API into the ‘raw’
tesseract C/C++ code.
The following also includes hints for interoperability with
can be helpful in cleaning up the image prior to textextraction. It’s useful to pre-clean, even though
iteself does some cleaning, because there’s often application-specific knowledge that can be used more effectively than
tesseract generic methods.
blog comments powered by Disqus