------------------------------------------------------------------- Tue Feb 15 08:42:30 UTC 2022 - Dirk Müller - update to 2.0.12: * ASCII miss-detection on rare cases (PR #170) * Explicit support for Python 3.11 (PR #164) * The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels ------------------------------------------------------------------- Mon Jan 10 23:01:54 UTC 2022 - Dirk Müller - update to 2.0.10: * Fallback match entries might lead to UnicodeDecodeError for large bytes sequence * Skipping the language-detection (CD) on ASCII ------------------------------------------------------------------- Mon Dec 6 20:08:41 UTC 2021 - Dirk Müller - update to 2.0.9: * Moderating the logging impact (since 2.0.8) for specific environments * Wrong logging level applied when setting kwarg `explain` to True ------------------------------------------------------------------- Mon Nov 29 11:14:37 UTC 2021 - Dirk Müller - update to 2.0.8: * Improvement over Vietnamese detection * MD improvement on trailing data and long foreign (non-pure latin) * Efficiency improvements in cd/alphabet_languages * call sum() without an intermediary list following PEP 289 recommendations * Code style as refactored by Sourcery-AI * Minor adjustment on the MD around european words * Remove and replace SRTs from assets / tests * Initialize the library logger with a `NullHandler` by default * Setting kwarg `explain` to True will add provisionally * Fix large (misleading) sequence giving UnicodeDecodeError * Avoid using too insignificant chunk * Add and expose function `set_logging_handler` to configure a specific StreamHandler ------------------------------------------------------------------- Fri Nov 26 11:35:25 UTC 2021 - Dirk Müller - require lower-case name instead of breaking build ------------------------------------------------------------------- Thu Nov 25 22:26:52 UTC 2021 - Matej Cepl - Use lower-case name of prettytable package ------------------------------------------------------------------- Sun Oct 17 14:01:59 UTC 2021 - Martin Hauke - Update to version 2.0.7 * Addition: bento Add support for Kazakh (Cyrillic) language detection * Improvement: sparkle Further improve inferring the language from a given code page (single-byte). * Removed: fire Remove redundant logging entry about detected language(s). * Improvement: zap Refactoring for potential performance improvements in loops. * Improvement: sparkles Various detection improvement (MD+CD). * Bugfix: bug Fix a minor inconsistency between Python 3.5 and other versions regarding language detection. - Update to version 2.0.6 * Bugfix: bug Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x. * Bugfix: bug Fix CLI crash when using --minimal output in certain cases. * Improvement: sparkles Minor improvement to the detection efficiency (less than 1%). - Update to version 2.0.5 * Improvement: sparkles The BC-support with v1.x was improved, the old staticmethods are restored. * Remove: fire The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead. * Improvement: sparkles The Unicode detection is slightly improved, see #93 * Bugfix: bug In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection. * Bugfix: bug Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection. * Improvement: art Add syntax sugar __bool__ for results CharsetMatches list-container. - Update to version 2.0.4 * Improvement: sparkle Adjust the MD to lower the sensitivity, thus improving the global detection reliability. * Improvement: sparkle Allow fallback on specified encoding if any. * Bugfix: bug The CLI no longer raise an unexpected exception when no encoding has been found. * Bugfix: bug Fix accessing the 'alphabets' property when the payload contains surrogate characters. * Bugfix: bug pencil2 The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (in #72) * Bugfix: bug Submatch factoring could be wrong in rare edge cases (in #72) * Bugfix: bug Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (in #72) * Internal: art Fix line endings from CRLF to LF for certain files. - Update to version 2.0.3 * Improvement: sparkles Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. #63 Fix #62 * Improvement: sparklesAccording to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. - Update to version 2.0.2 * Bugfix: bug Empty/Too small JSON payload miss-detection fixed. * Improvement: sparkler Don't inject unicodedata2 into sys.modules - Update to version 2.0.1 * Bugfix: bug Make it work where there isn't a filesystem available, dropping assets frequencies.json. * Improvement: sparkles You may now use aliases in cp_isolation and cp_exclusion arguments. * Bugfix: bug Using explain=False permanently disable the verbose output in the current runtime #47 * Bugfix: bug One log entry (language target preemptive) was not show in logs when using explain=True #47 * Bugfix: bug Fix undesired exception (ValueError) on getitem of instance CharsetMatches #52 * Improvement: wrench Public function normalize default args values were not aligned with from_bytes #53 - Update to version 2.0.0 * Performance: zap 4x to 5 times faster than the previous 1.4.0 release. * Performance: zap At least 2x faster than Chardet. * Performance: zap Accent has been made on UTF-8 detection, should perform rather instantaneous. * Improvement: back The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible. * Improvement: sparkle The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time) * Code: art The program has been rewritten to ease the readability and maintainability. (+Using static typing) * Tests: heavy_check_mark New workflows are now in place to verify the following aspects: Performance, Backward- Compatibility with Chardet, and Detection Coverage in addition# to currents tests. (+CodeQL) * Dependency: heavy_minus_sign This package no longer require anything when used with Python 3.5 (Dropped cached_property) * Docs: pencil2 Performance claims have been updated, the guide to contributing, and the issue template. * Improvement: sparkle Add --version argument to CLI * Bugfix: bug The CLI output used the relative path of the file(s). Should be absolute. * Deprecation: red_circle Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0 * Improvement: sparkle If no language was detected in content, trying to infer it using the encoding name/alphabets used. * Removal: fire Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian. * Improvement: sparkle utf_7 detection has been reinstated. * Removal: fire The exception hook on UnicodeDecodeError has been removed. - Update to version 1.4.1 * Improvement: art Logger configuration/usage no longer conflict with others #44 - Update to version 1.4.0 * Dependency: heavy_minus_sign Using standard logging instead of using the package loguru. * Dependency: heavy_minus_sign Dropping nose test framework in favor of the maintained pytest. * Dependency: heavy_minus_sign Choose to not use dragonmapper package to help with gibberish Chinese/CJK text. * Dependency: wrench heavy_minus_sign Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version. * Bugfix: bug BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process. * Improvement: sparkler Return ASCII if given sequences fit. * Performance: zap Huge improvement over the larges payload. * Change: fire Stop support for UTF-7 that does not contain a SIG. (Contributions are welcome to improve that point) * Feature: sparkler CLI now produces JSON consumable output. * Dependency: Dropping PrettyTable, replaced with pure JSON output. * Bugfix: bug Not searching properly for the BOM when trying utf32/16 parent codec. * Other: zap Improving the package final size by compressing frequencies.json. ------------------------------------------------------------------- Thu May 20 09:46:56 UTC 2021 - pgajdos@suse.com - version update to 1.3.9 * Bugfix: bug In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload #40 * Bugfix: bug Empty given payload for detection may cause an exception if trying to access the alphabets property. #39 * Bugfix: bug The legacy detect function should return UTF-8-SIG if sig is present in the payload. #38 ------------------------------------------------------------------- Tue Feb 9 00:47:34 UTC 2021 - John Vandenberg - Switch to PyPI source - Add Suggests: python-unicodedata2 - Remove executable bit from charset_normalizer/assets/frequencies.json - Update to v1.3.6 * Allow prettytable 2.0 - from v1.3.5 * Dependencies refactor and add support for py 3.9 and 3.10 * Fix version parsing ------------------------------------------------------------------- Mon May 25 10:59:12 UTC 2020 - Petr Gajdos - %python3_only -> %python_alternative ------------------------------------------------------------------- Mon Jan 27 09:09:27 UTC 2020 - Marketa Calabkova - Update to 1.3.4 * Improvement/Bugfix : False positive when searching for successive upper, lower char. (ProbeChaos) * Improvement : Noticeable better detection for jp * Bugfix : Passing zero-length bytes to from_bytes * Improvement : Expose version in package * Bugfix : Division by zero * Improvement : Prefers unicode (utf-8) when detected * Apparently dropped Python2 silently ------------------------------------------------------------------- Fri Oct 4 08:52:51 UTC 2019 - Marketa Calabkova - Update to 1.3.0 * Backport unicodedata for v12 impl into python if available * Add aliases to CharsetNormalizerMatches class * Add feature preemptive behaviour, looking for encoding declaration * Add method to determine if specific encoding is multi byte * Add has_submatch property on a match * Add percent_chaos and percent_coherence * Coherence ratio based on mean instead of sum of best results * Using loguru for trace/debug <3 * from_byte method improved ------------------------------------------------------------------- Thu Sep 26 10:35:51 UTC 2019 - Tomáš Chvátal - Update to 1.1.1: * from_bytes parameters steps and chunk_size were not adapted to sequence len if provided values were not fitted to content * Sequence having lenght bellow 10 chars was not checked * Legacy detect method inspired by chardet was not returning * Various more test updates ------------------------------------------------------------------- Fri Sep 13 11:05:06 UTC 2019 - Tomáš Chvátal - Update to 0.3: * Improvement on detection * Performance loss to expect * Added --threshold option to CLI * Bugfix on UTF 7 support * Legacy detect(byte_str) method * BOM support (Unicode mostly) * Chaos prober improved on small text * Language detection has been reviewed to give better result * Bugfix on jp detection, every jp text was considered chaotic ------------------------------------------------------------------- Fri Aug 30 00:46:27 UTC 2019 - Tomáš Chvátal - Fix the tarball to really be the one published by upstream ------------------------------------------------------------------- Tue Aug 28 06:29:02 PM UTC 2019 - John Vandenberg - Initial spec for v0.1.8