- Fixed an issue where damaged PDFs would fail with --redo-ocr.
:issue:`1403`
- Fixed an error that prevented JBIG2 optimization on Windows
if the image was optimized in an earlier step. :issue:`1396`
- Fixed an error detecting the version of unpaper 7.0.0.
:issue:`1409`
- Fixed a performance regression when scanning pages.
:issue:`1378`. Thanks @aliemjay.
- Fixed Alpine Docker image by enforcing Alpine 3.19. Alpine
3.20 includes a defective version of Tesseract OCR and so is
not usable.
- Upgraded Ubuntu Docker image to use Ubuntu 24.04.
- Build and test scripts/actions switched to uv.
- When running in a container, we now remind the user that
temporary folders are inside the container and may not be
accessible.
- Fixed Linux test coverage matrix, which was missing some key
versions.
- Update to 16.5.0:
- Fixed issue with interpreting PDFs that have images with
array masks. :issue:`1377`
- Enabled testing on Python 3.13.
- Fixed a test that did not work correctly but still passed.
:issue:`1382`
- Improved "PDF/A conversion failed" warning message to better
describe implications.
- Updated documentation to better explain OCR_JSON_SETTINGS in
batch processing.
- Build backend changed from setuptools to hatchling.
OBS-URL: https://build.opensuse.org/package/show/devel:languages:python/python-ocrmypdf?expand=0&rev=2
170 lines
8.1 KiB
Plaintext
170 lines
8.1 KiB
Plaintext
-------------------------------------------------------------------
|
|
Wed Nov 6 14:57:33 UTC 2024 - Matej Cepl <mcepl@cepl.eu>
|
|
|
|
- Update to 16.6.0:
|
|
- Fixed an issue where damaged PDFs would fail with --redo-ocr.
|
|
:issue:`1403`
|
|
- Fixed an error that prevented JBIG2 optimization on Windows
|
|
if the image was optimized in an earlier step. :issue:`1396`
|
|
- Fixed an error detecting the version of unpaper 7.0.0.
|
|
:issue:`1409`
|
|
- Fixed a performance regression when scanning pages.
|
|
:issue:`1378`. Thanks @aliemjay.
|
|
- Fixed Alpine Docker image by enforcing Alpine 3.19. Alpine
|
|
3.20 includes a defective version of Tesseract OCR and so is
|
|
not usable.
|
|
- Upgraded Ubuntu Docker image to use Ubuntu 24.04.
|
|
- Build and test scripts/actions switched to uv.
|
|
- When running in a container, we now remind the user that
|
|
temporary folders are inside the container and may not be
|
|
accessible.
|
|
- Fixed Linux test coverage matrix, which was missing some key
|
|
versions.
|
|
- Update to 16.5.0:
|
|
- Fixed issue with interpreting PDFs that have images with
|
|
array masks. :issue:`1377`
|
|
- Enabled testing on Python 3.13.
|
|
- Fixed a test that did not work correctly but still passed.
|
|
:issue:`1382`
|
|
- Improved "PDF/A conversion failed" warning message to better
|
|
describe implications.
|
|
- Updated documentation to better explain OCR_JSON_SETTINGS in
|
|
batch processing.
|
|
- Build backend changed from setuptools to hatchling.
|
|
- Update to 16.4.3:
|
|
- Work around pdfminer.six issue where a token on the buffer
|
|
boundary is incorrectly parsed as two tokens. :issue:`1361`
|
|
- New rules are applied to stencil masks and explicit masks
|
|
when calculating the optimal page DPI for rendering.
|
|
:issue:`1362`
|
|
- Fixed attempts to use an incompatible jbig2.EXE provided by
|
|
TeX Live. :issue:`1363`
|
|
- Update to 16.4.2:
|
|
- Fixed order of filenames passed to Ghostscript for PDF/A
|
|
generation. :issue:`1359`
|
|
- Suppressed missing jbig2dec warning message. :issue:`1358`
|
|
- Fixed calculation of image size when soft mask dimensions
|
|
don't match image dimension. :issue:`1351`
|
|
- Several fixes to documentation. Thanks to users Iris and
|
|
JoKalliauer who contributed these changes.
|
|
- Fixed error on processing PDFs that are missing certain image
|
|
metadata. :issue:`1315`
|
|
- Update to 16.4.1:
|
|
- Fixed calculation of image printed area (used in finding
|
|
weighted DPI for OCR). :issue:`1334`
|
|
- Fixed "NotImplementedError: not sure how to get colorspace"
|
|
error messages in logs which simply records a failure
|
|
to optimize images with print production colorspaces.
|
|
:issue:`1315`
|
|
- Update to 16.4.0:
|
|
- Selecting the osd and equ pseudo-languages with -l/--language
|
|
now exits with an error when using Tesseract OCR, because
|
|
these are not regular Tesseract languages but implementation
|
|
details implemented. Using them can cause Tesseract to crash.
|
|
- The hOCR renderer is more tolerant of extra whitespace in
|
|
input files.
|
|
- watcher.py now changes the output file extension to .pdf when
|
|
the input is not .pdf.
|
|
- Improved handling of PDFs that contain circularly referenced
|
|
Form XObjects. :issue:`1321`
|
|
- Fixed Alpine Docker image for ARM64, which was not building
|
|
correctly.
|
|
- Docker images now use pikepdf 9.0.0.
|
|
- Prevent use of Tesseract OCR 5.4.0, a version with known
|
|
regressions.
|
|
- Disabled progressbar for "Linearizing" when --no-progress-bar
|
|
set.
|
|
- Fixed some tests that warn about missing JBIG2 decoding via
|
|
pikepdf, by installing the necessary libraries during tests.
|
|
- Update to 16.3.1:
|
|
- Fixed a test suite failure with Ghostscript 10.03.0+.
|
|
:issue:`1316`
|
|
- Fixed an issue with the presentation of the "OCR" progress
|
|
bar. :issue:`1313`
|
|
- Update to 16.3.0:
|
|
- Fixed progress bar not displaying for Ghostscript PDF/A
|
|
conversion. :issue:`1313`
|
|
- Added progress bar for linearization. :issue:`1313`
|
|
- If --rotate-pages-threshold issued without --rotate-pages we
|
|
now exit with an error since the user likely intended to use
|
|
--rotate-pages. :issue:`1309`
|
|
- If Tesseract hOCR gives an invalid line box, print an error
|
|
message instead of exiting with an error. :issue:`1312`
|
|
- Update to 16.2.0:
|
|
- Fixed issue 'NoneType' object has no attribute 'get' when
|
|
optimizing certain PDFs. :issue:`1293,1271`
|
|
- Switched formatting from black to ruff.
|
|
- Added support for sending sidecar output to io.BytesIO.
|
|
- Added support for converting HEIF/HEIC images (the native
|
|
image of iPhones and some other devices) to PDFs, when the
|
|
appropriate pi-hief library is installed. This library is
|
|
marked as a dependency, but maintainers may opt out if
|
|
needed.
|
|
- We now default to downsampling large images that would
|
|
exceed Tesseract's internal limits, but only if it cause
|
|
processing to fail. Previously, this behavior only occurred
|
|
if specifically requested on command line. It can still be
|
|
configured and disabled. See the --tesseract command line
|
|
options.
|
|
- Added Macports install instructions. Thanks @akierig.
|
|
- Improved logging output when an unexpected error occurs while
|
|
trying to obtain the version of a third party program.
|
|
- Update to 16.1.2:
|
|
- Fixed test suite failure when using Ghostscript 10.3.
|
|
- Other minor corrections.
|
|
- Update to 16.1.1:
|
|
- Fixed PyPy 3.10 support.
|
|
- Update to 16.1.0:
|
|
- Improved hOCR renderer is now default for left to right
|
|
languages.
|
|
- Improved handling of rotated pages. Previously, OCR text
|
|
might be missing for pages that were rotated with a /Rotate
|
|
tag on the page entry.
|
|
- Improved handling of cropped pages. Previously, in some
|
|
cases a page with a crop box would not have its OCR applied
|
|
correctly and misalignment between OCR text and visible text
|
|
coudl occur.
|
|
- Documentation improvements, especially installation
|
|
instructions for less common platforms.
|
|
|
|
-------------------------------------------------------------------
|
|
Mon Jan 8 15:26:44 UTC 2024 - ecsos <ecsos@opensuse.org>
|
|
|
|
- Update to 16.0.4
|
|
- Fixed some issues for left-to-right text with the new hOCR renderer.
|
|
It is still not default yet but will be made so soon.
|
|
Right-to-left text is still in progress.
|
|
- Added an error to prevent use of several versions of Ghostscript
|
|
that seem corrupt existing text in input PDFs.
|
|
Newly generated OCR is not affected.
|
|
For best results, use Ghostscript 10.02.1 or newer,
|
|
which contains the fix for the issue.
|
|
|
|
-------------------------------------------------------------------
|
|
Thu Jan 4 10:05:05 UTC 2024 - ecsos <ecsos@opensuse.org>
|
|
|
|
- Update to 16.0.3
|
|
- Changed minimum required Ghostscript to 9.54, to support users of RHEL 9 and its derivatives,
|
|
since that is the latest version available there.
|
|
- Removed warning message about CVE-2023-43115, on the assumption that most distributions have backported the patch by now.
|
|
- Changes from 16.0.2
|
|
- Temporarily changed PDF text renderer back to sandwich by default to address regressions in macOS Preview.
|
|
- Changes from 16.0.1
|
|
- Fixed text rendering issue with new hOCR text renderer - extraneous byte order marks.
|
|
- Tightened dependencies.
|
|
- Changes from 16.0.0
|
|
- Added OCR text renderer, combined the best ideas of Tesseract's PDF generator and the older hOCR transformer renderer.
|
|
The result is a hopefully permanent fix for wordssmushedtogetherwithoutspaces issues in extracted text, better
|
|
registration/position of text on skewed baselines :issue:`1009`, fixes to character output when the German Fraktur script
|
|
is used :issue:`1191`, proper rendering of right to left languages (Arabic, Hebrew, Persian) :issue:`1157`.
|
|
Asian languages may still have excessive word breaks compared to expectations. The new renderer is the default;
|
|
the old sandwich renderer is still available using --pdf-renderer sandwich; the old hOCR renderer is no more.
|
|
- The ocrmypdf.hocrtransform API has changed substantially.
|
|
- Support for Python 3.9 has been dropped. Python 3.10+ is now required.
|
|
- pikepdf >= 8.8.0 is now required.
|
|
|
|
-------------------------------------------------------------------
|
|
Fri Dec 15 08:32:05 UTC 2023 - ecsos <ecsos@opensuse.org>
|
|
|
|
- Initial version 15.4.4
|