Package Details: ocrmypdf 11.3.1-1

Git Clone URL: https://aur.archlinux.org/ocrmypdf.git (read-only, click to copy)
Package Base: ocrmypdf
Description: A tool to add an OCR text layer to scanned PDF files, allowing them to be searched
Upstream URL: https://github.com/jbarlow83/OCRmyPDF
Licenses: MPL2
Submitter: dreuter
Maintainer: fbrennan (pigmonkey)
Last Packager: pigmonkey
Votes: 49
Popularity: 3.25
First Submitted: 2014-01-27 11:36
Last Updated: 2020-10-28 22:19

Latest Comments

1 2 3 4 5 6 ... Next › Last »

ginkel commented on 2020-10-26 10:56

ocrmypdf currently fails to work with the recently updated python-pdfminer package. Downgrading the package to python-pdfminer-20200726-1 works around the issue for now.

pkg_resources.DistributionNotFound: The 'pdfminer.six!=20200720,<=20200726,>=20191110' distribution was not found and is required by ocrmypdf

pigmonkey commented on 2020-10-19 12:42

I still use the package, so I'm happy to continue updating or to step back. No preference.

fbrennan commented on 2020-10-18 23:02

Hello all.

I'm back to using Arch if pigmonkey no longer wants to maintain this package. :-)

But I think they've done a good job so can also just give them the package. I can also just do nothing, but since I'm back in that situation it can be confusing who is responsible to push the update.

Which would you prefer?

pigmonkey commented on 2020-10-14 22:36

tesseract-data-osd is included with the standard tesseract Arch package.

Looking at the "Required By" section of the tesseract-data-eng package, it does not appear that it is common for other Arch packages to list it as a dependency.

If this is confusing for users, I think it would be acceptable to add it as an optional dependency, so that there is an indication at the end of the install that another package might be needed. But it may be weird for non-English speakers if the package has an optional dependency on the English language pack, but not whatever data pack is needed for the user's native language. I don't really want a 106 item optdepends array for every possible language pack.

jbarlow commented on 2020-10-14 07:07

OCRmyPDF assumes English unless a language is specified with -l fra for example. So strictly speaking it works, but you have to issue the option every time. The test suite also assumes English is installed. I believe most package managers have added an explicit dependency on tesseract-data-eng or whatever it's called in the system, but some have not.

I did poll users whether to default to the system language based on locale, but surprisingly non-English users didn't like the idea.

OCRmyPDF does assume tesseract-data-osd is installed so that should be a dependency if Arch breaks that out as a separate package.

pigmonkey commented on 2020-10-13 16:51

Tesseract does require a data package to be installed, but it does not have to be English. If a language is not specified, Tesseract does assume English, hence the error.

I don't think it's appropriate to include tesseract-data-eng as a dependency since that might not be the user's language.

ioan commented on 2020-10-13 13:45

crmypdf test.pdf test2.pdf Tesseract failed to report available languages. Output from Tesseract:


Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! List of available languages (1): osd

looks like it needs eng data by default

jorges commented on 2020-08-05 19:49

Thanks for the explanation! I just got rid of pyhton-pdfminer.six from AUR and downgraded python-pdfminer to 20200517-1. OCRMyPDF works and all is well!

pigmonkey commented on 2020-07-29 17:39

It's a little convoluted, but here is what I think is happening:

The confusingly-named python-pdfminer from community that we use is in fact python-pdfminer.six. You can verify that by looking at its PKGBUILD. The AUR python-pdfminer.six is basically the same package, except it pulls from PyPi instead of Github and is on an outdated version (20200124 instead of community's 20200720).

OCRMyPDF claims to support 20200720, but that version of python-pdfminer{,.six} dropped PDFTextExtractionNotAllowed. This apparently was unintentional and has been reversed in 20200726. But as of now 20200726 has not been officially tagged.

So, we need to wait for upstream python-pdfminer.six to make 20200726 official, and then wait for the community maintainer to update the python-pdfminer package to 20200726. And then we need to wait for upstream OCRMyPDF to release a new version that notes support for 20200726. Then I can update this package and everything will be copacetic.

In the meantime, you can downgrade the community python-pdfminer package to the previous version, or run the much older version provided by the AUR python-pdfminer.six package.

jorges commented on 2020-07-29 11:18

I was getting the traceback shown below with python-pdfminer. I was able to solve the problem by removing that package and installing python-pdfminer.six. I other people can confirm this maybe the package dependencies have to be changed?

$ ocrmypdf 
Traceback (most recent call last):
  File "/usr/bin/ocrmypdf", line 33, in <module>
    sys.exit(load_entry_point('ocrmypdf==10.3.1', 'console_scripts', 'ocrmypdf')())
  File "/usr/bin/ocrmypdf", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/lib/python3.8/site-packages/ocrmypdf/__init__.py", line 21, in <module>
    from ocrmypdf import helpers, hocrtransform, leptonica, pdfa, pdfinfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/__init__.py", line 19, in <module>
    from ocrmypdf.pdfinfo.info import Colorspace, Encoding, PdfInfo
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/info.py", line 37, in <module>
    from ocrmypdf.pdfinfo.layout import get_page_analysis, get_text_boxes
  File "/usr/lib/python3.8/site-packages/ocrmypdf/pdfinfo/layout.py", line 29, in <module>
    from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfdocument' (/usr/lib/python3.8/site-packages/pdfminer/pdfdocument.py)
(ins)[jscandal@lhasa .aur_bb]$ python
Python 3.8.5 (default, Jul 27 2020, 08:42:51) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(ins)>>> from pdfminer.pdfdocument import PDFTextExtractionNotAllowed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PDFTextExtractionNotAllowed' from 'pdfminer.pdfdocument' (/usr/lib/python3.8/site-packages/pdfminer/pdfdocument.py)