forked from pool/python-ftfy
Accepting request 613293 from devel:languages:python:misc
Try to convert buggy unicode text to a less broken variant. OBS-URL: https://build.opensuse.org/request/show/613293 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python/python-ftfy?expand=0&rev=1
This commit is contained in:
23
.gitattributes
vendored
Normal file
23
.gitattributes
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
## Default LFS
|
||||
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||
*.bsp filter=lfs diff=lfs merge=lfs -text
|
||||
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.gem filter=lfs diff=lfs merge=lfs -text
|
||||
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||
*.jar filter=lfs diff=lfs merge=lfs -text
|
||||
*.lz filter=lfs diff=lfs merge=lfs -text
|
||||
*.lzma filter=lfs diff=lfs merge=lfs -text
|
||||
*.obscpio filter=lfs diff=lfs merge=lfs -text
|
||||
*.oxt filter=lfs diff=lfs merge=lfs -text
|
||||
*.pdf filter=lfs diff=lfs merge=lfs -text
|
||||
*.png filter=lfs diff=lfs merge=lfs -text
|
||||
*.rpm filter=lfs diff=lfs merge=lfs -text
|
||||
*.tbz filter=lfs diff=lfs merge=lfs -text
|
||||
*.tbz2 filter=lfs diff=lfs merge=lfs -text
|
||||
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||
*.ttf filter=lfs diff=lfs merge=lfs -text
|
||||
*.txz filter=lfs diff=lfs merge=lfs -text
|
||||
*.whl filter=lfs diff=lfs merge=lfs -text
|
||||
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
.osc
|
||||
3
ftfy-5.3.0.tar.gz
Normal file
3
ftfy-5.3.0.tar.gz
Normal file
@@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:0ba702d5138f9b35df32b55920c9466208608108f1f3d5de1a68c17e3d68cb7f
|
||||
size 53827
|
||||
206
python-ftfy.changes
Normal file
206
python-ftfy.changes
Normal file
@@ -0,0 +1,206 @@
|
||||
-------------------------------------------------------------------
|
||||
Wed May 16 16:10:48 UTC 2018 - toddrme2178@gmail.com
|
||||
|
||||
- Update to Version 5.3 (January 25, 2018)
|
||||
* A heuristic has been too conservative since version 4.2, causing a regression
|
||||
compared to previous versions: ftfy would fail to fix mojibake of common
|
||||
characters such as `á` when seen in isolation. A new heuristic now makes it
|
||||
possible to fix more of these common cases with less evidence.
|
||||
- Update to Version 5.2 (November 27, 2017)
|
||||
* The command-line tool will not accept the same filename as its input
|
||||
and output. (Previously, this would write a zero-length file.)
|
||||
* The `uncurl_quotes` fixer, which replaces curly quotes with straight quotes,
|
||||
now also replaces MODIFIER LETTER APOSTROPHE.
|
||||
* Codepoints that contain two Latin characters crammed together for legacy
|
||||
encoding reasons are replaced by those two separate characters, even in NFC
|
||||
mode. We formerly did this just with ligatures such as `fi` and `IJ`, but now
|
||||
this includes the Afrikaans digraph `ʼn` and Serbian/Croatian digraphs such as
|
||||
`dž`.
|
||||
- Update to Version 5.1.1 and 4.4.3 (May 15, 2017)
|
||||
- These releases fix two unrelated problems with the tests, one in each version.
|
||||
* v5.1.1: fixed the CLI tests (which are new in v5) so that they pass
|
||||
on Windows, as long as the Python output encoding is UTF-8.
|
||||
* v4.4.3: added the `# coding: utf-8` declaration to two files that were
|
||||
missing it, so that tests can run on Python 2.
|
||||
- Update to Version 5.1 (April 7, 2017)
|
||||
* Removed the dependency on `html5lib` by dropping support for Python 3.2.
|
||||
We previously used the dictionary `html5lib.constants.entities` to decode
|
||||
HTML entities. In Python 3.3 and later, that exact dictionary is now in the
|
||||
standard library as `html.entities.html5`.
|
||||
* Moved many test cases about how particular text should be fixed into
|
||||
`test_cases.json`, which may ease porting to other languages.
|
||||
- Update to Version 5.0.2 and 4.4.2 (March 21, 2017)
|
||||
* Added a `MANIFEST.in` that puts files such as the license file and this
|
||||
changelog inside the source distribution.
|
||||
- Update to Version 5.0.1 and 4.4.1 (March 10, 2017)
|
||||
- Bug fix:
|
||||
* The `unescape_html` fixer will decode entities between `€` and `Ÿ`
|
||||
as what they would be in Windows-1252, even without the help of
|
||||
`fix_encoding`.
|
||||
This better matches what Web browsers do, and fixes a regression that version
|
||||
4.4 introduced in an example that uses `…` as an ellipsis.
|
||||
- Update to Version 5.0 (February 17, 2017)
|
||||
- Breaking changes:
|
||||
* Dropped support for Python 2. If you need Python 2 support, you should get
|
||||
version 4.4, which has the same features as this version.
|
||||
* The top-level functions require their arguments to be given as keyword
|
||||
arguments.
|
||||
- Update to Version 4.4.0 (February 17, 2017)
|
||||
- Heuristic changes:
|
||||
* ftfy can now fix mojibake involving the Windows-1250 or ISO-8859-2 encodings.
|
||||
* The `fix_entities` fixer is now applied after `fix_encoding`. This makes
|
||||
more situations resolvable when both fixes are needed.
|
||||
* With a few exceptions for commonly-used characters such as `^`, it is now
|
||||
considered "weird" whenever a diacritic appears in non-combining form,
|
||||
such as the diaeresis character `¨`.
|
||||
* It is also now weird when IPA phonetic letters, besides `ə`, appear next to
|
||||
capital letters.
|
||||
* These changes to the heuristics, and others we've made in recent versions,
|
||||
let us lower the "cost" for fixing mojibake in some encodings, causing them
|
||||
to be fixed in more cases.
|
||||
- Update to Version 4.3.1 (January 12, 2017)
|
||||
- Bug fix:
|
||||
* `remove_control_chars` was removing U+0D ('\r') prematurely. That's the
|
||||
job of `fix_line_breaks`.
|
||||
- Update to Version 4.3.0 (December 29, 2016)
|
||||
* This version now depends on the `html5lib` and `wcwidth` libraries.
|
||||
- Feature changes:
|
||||
* The `remove_control_chars` fixer will now remove some non-ASCII control
|
||||
characters as well, such as deprecated Arabic control characters and
|
||||
byte-order marks. Bidirectional controls are still left as is.
|
||||
This should have no impact on well-formed text, while cleaning up many
|
||||
characters that the Unicode Consortium deems "not suitable for markup"
|
||||
(see Unicode Technical Report #20).
|
||||
* The `unescape_html` fixer uses a more thorough list of HTML entities,
|
||||
which it imports from `html5lib`.
|
||||
* `ftfy.formatting` now uses `wcwidth` to compute the width that a string
|
||||
will occupy in a text console.
|
||||
- Heuristic changes:
|
||||
* Updated the data file of Unicode character categories to Unicode 9, as used
|
||||
in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the
|
||||
same data.)
|
||||
- Pending deprecations:
|
||||
* The `remove_bom` option will become deprecated in 5.0, because it has been
|
||||
superseded by `remove_control_chars`.
|
||||
* ftfy 5.0 will remove the previously deprecated name `fix_text_encoding`. It
|
||||
was renamed to `fix_encoding` in 4.0.
|
||||
* ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, please
|
||||
specify `ftfy < 5` in your dependencies if you haven't already.
|
||||
- Update to Version 4.2.0 (September 28, 2016)
|
||||
- Heuristic changes:
|
||||
* Math symbols next to currency symbols are no longer considered 'weird' by the
|
||||
heuristic. This fixes a false positive where text that involved the
|
||||
multiplication sign and British pounds or euros (as in '5×£35') could turn
|
||||
into Hebrew letters.
|
||||
* A heuristic that used to be a bonus for certain punctuation now also gives a
|
||||
bonus to successfully decoding other common codepoints, such as the
|
||||
non-breaking space, the degree sign, and the byte order mark.
|
||||
* In version 4.0, we tried to "future-proof" the categorization of emoji (as a
|
||||
kind of symbol) to include codepoints that would likely be assigned to emoji
|
||||
later. The future happened, and there are even more emoji than we expected.
|
||||
We have expanded the range to include those emoji, too.
|
||||
ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is),
|
||||
but this expanded range should include the emoji from Unicode 9 and 10.
|
||||
* Emoji are increasingly being modified by variation selectors and skin-tone
|
||||
modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they
|
||||
fit right in with emoji, instead of being considered 'marks' as their Unicode
|
||||
category would suggest.
|
||||
This enables fixing mojibake that involves iOS's new diverse emoji.
|
||||
* An old heuristic that wasn't necessary anymore considered Latin text with
|
||||
high-numbered codepoints to be 'weird', but this is normal in languages such
|
||||
as Vietnamese and Azerbaijani. This does not seem to have caused any false
|
||||
positives, but it caused ftfy to be too reluctant to fix some cases of broken
|
||||
text in those languages.
|
||||
The heuristic has been changed, and all languages that use Latin letters
|
||||
should be on even footing now.
|
||||
- Update to Version 4.1.1 (April 13, 2016)
|
||||
* Bug fix: in the command-line interface, the `-e` option had no effect on
|
||||
Python 3 when using standard input. Now, it correctly lets you specify
|
||||
a different encoding for standard input.
|
||||
- Update to Version 4.1.0 (February 25, 2016)
|
||||
- Heuristic changes:
|
||||
* ftfy can now deal with "lossy" mojibake. If your text has been run through
|
||||
a strict Windows-1252 decoder, such as the one in Python, it may contain
|
||||
the replacement character <20> (U+FFFD) where there were bytes that are
|
||||
unassigned in Windows-1252.
|
||||
Although ftfy won't recover the lost information, it can now detect this
|
||||
situation, replace the entire lossy character with <20>, and decode the rest of
|
||||
the characters. Previous versions would be unable to fix any string that
|
||||
contained U+FFFD.
|
||||
As an example, text in curly quotes that gets corrupted `“ like this â€<C3A2>`
|
||||
now gets fixed to be `“ like this <20>`.
|
||||
* Updated the data file of Unicode character categories to Unicode 8.0, as used
|
||||
in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the
|
||||
same data.)
|
||||
* Heuristics now count characters such as `~` and `^` as punctuation instead
|
||||
of wacky math symbols, improving the detection of mojibake in some edge cases.
|
||||
- New features:
|
||||
* A new module, `ftfy.formatting`, can be used to justify Unicode text in a
|
||||
monospaced terminal. It takes into account that each character can take up
|
||||
anywhere from 0 to 2 character cells.
|
||||
* Internally, the `utf-8-variants` codec was simplified and optimized.
|
||||
- Update to Version 4.0.0 (April 10, 2015)
|
||||
- Breaking changes:
|
||||
* The default normalization form is now NFC, not NFKC. NFKC replaces a large
|
||||
number of characters with 'equivalent' characters, and some of these
|
||||
replacements are useful, but some are not desirable to do by default.
|
||||
* The `fix_text` function has some new options that perform more targeted
|
||||
operations that are part of NFKC normalization, such as
|
||||
`fix_character_width`, without requiring hitting all your text with the huge
|
||||
mallet that is NFKC.
|
||||
* The `remove_unsafe_private_use` parameter has been removed entirely, after
|
||||
two versions of deprecation. The function name `fix_bad_encoding` is also
|
||||
gone.
|
||||
- New features:
|
||||
* Fixers for strange new forms of mojibake, including particularly clear cases
|
||||
of mixed UTF-8 and Windows-1252.
|
||||
* New heuristics, so that ftfy can fix more stuff, while maintaining
|
||||
approximately zero false positives.
|
||||
* The command-line tool trusts you to know what encoding your *input* is in,
|
||||
and assumes UTF-8 by default. You can still tell it to guess with the `-g`
|
||||
option.
|
||||
* The command-line tool can be configured with options, and can be used as a
|
||||
pipe.
|
||||
* Recognizes characters that are new in Unicode 7.0, as well as emoji from
|
||||
Unicode 8.0+ that may already be in use on iOS.
|
||||
- Deprecations:
|
||||
* `fix_text_encoding` is being renamed again, for conciseness and consistency.
|
||||
It's now simply called `fix_encoding`. The name `fix_text_encoding` is
|
||||
available but emits a warning.
|
||||
- Pending deprecations:
|
||||
* Python 2.6 support is largely coincidental.
|
||||
* Python 2.7 support is on notice. If you use Python 2, be sure to pin a
|
||||
version of ftfy less than 5.0 in your requirements.
|
||||
|
||||
- Implement single-spec version
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon Jul 13 13:12:38 UTC 2015 - toddrme2178@gmail.com
|
||||
|
||||
- Fix building on SLES 11
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Thu May 7 07:07:50 UTC 2015 - jweberhofer@weberhofer.at
|
||||
|
||||
- Use the tar-ball from pypi.python.org
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon May 4 15:04:36 UTC 2015 - jweberhofer@weberhofer.at
|
||||
|
||||
- Updated to version 3.4.0
|
||||
|
||||
* ftfy.fixes.fix_surrogates will fix all 16-bit surrogate codepoints, which
|
||||
would otherwise break various encoding and output functions.
|
||||
|
||||
* remove_unsafe_private_use emits a warning, and will disappear in the next
|
||||
minor or major version.
|
||||
|
||||
- Updated to version 3.3.1
|
||||
|
||||
* restores compatibility with Python 2.6.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon Aug 18 12:59:42 UTC 2014 - jweberhofer@weberhofer.at
|
||||
|
||||
- Initial RPM package for version 3.3.0
|
||||
|
||||
75
python-ftfy.spec
Normal file
75
python-ftfy.spec
Normal file
@@ -0,0 +1,75 @@
|
||||
#
|
||||
# spec file for package python-ftfy
|
||||
#
|
||||
# Copyright (c) 2018 SUSE LINUX GmbH, Nuernberg, Germany.
|
||||
#
|
||||
# All modifications and additions to the file contributed by third parties
|
||||
# remain the property of their copyright owners, unless otherwise agreed
|
||||
# upon. The license for this file, and modifications and additions to the
|
||||
# file, is the same license as for the pristine package itself (unless the
|
||||
# license for the pristine package is not an Open Source License, in which
|
||||
# case the license is the MIT License). An "Open Source License" is a
|
||||
# license that conforms to the Open Source Definition (Version 1.9)
|
||||
# published by the Open Source Initiative.
|
||||
|
||||
# Please submit bugfixes or comments via http://bugs.opensuse.org/
|
||||
|
||||
|
||||
%{?!python_module:%define python_module() python-%{**} python3-%{**}}
|
||||
%define skip_python2 1
|
||||
Name: python-ftfy
|
||||
Version: 5.3.0
|
||||
Release: 0
|
||||
License: MIT
|
||||
Summary: Fixes some problems with Unicode text after the fact
|
||||
Url: http://github.com/LuminosoInsight/python-ftfy
|
||||
Group: Development/Languages/Python
|
||||
Source: https://files.pythonhosted.org/packages/source/f/ftfy/ftfy-%{version}.tar.gz
|
||||
BuildRequires: %{python_module devel}
|
||||
BuildRequires: %{python_module setuptools}
|
||||
BuildRequires: fdupes
|
||||
BuildRequires: python-rpm-macros
|
||||
# SECTION test requirements
|
||||
BuildRequires: %{python_module nose}
|
||||
BuildRequires: %{python_module pytest}
|
||||
BuildRequires: %{python_module pytest-runner}
|
||||
BuildRequires: %{python_module wcwidth}
|
||||
# /SECTION
|
||||
Requires: python-wcwidth
|
||||
BuildArch: noarch
|
||||
|
||||
%python_subpackages
|
||||
|
||||
%description
|
||||
Ftfy makes Unicode text less broken and more consistent.
|
||||
|
||||
The most interesting kind of brokenness that this resolves
|
||||
is when someone has encoded Unicode with one standard and
|
||||
decoded it with a different one.
|
||||
|
||||
|
||||
%prep
|
||||
%setup -q -n ftfy-%{version}
|
||||
|
||||
%build
|
||||
%python_build
|
||||
|
||||
%install
|
||||
%python_install
|
||||
%python_expand %fdupes %{buildroot}%{$python_sitelib}
|
||||
|
||||
%check
|
||||
%{python_expand export PYTHONDONTWRITEBYTECODE=1
|
||||
export LANG=en_US.UTF-8
|
||||
export PYTHONPATH=%{buildroot}%{$python_sitelib}
|
||||
export PATH="$PATH:%{buildroot}%{_bindir}"
|
||||
py.test-%{$python_bin_suffix}
|
||||
}
|
||||
|
||||
%files %{python_files}
|
||||
%doc CHANGELOG.md README.md
|
||||
%license LICENSE.txt
|
||||
%python3_only %{_bindir}/ftfy
|
||||
%{python_sitelib}/*
|
||||
|
||||
%changelog
|
||||
Reference in New Issue
Block a user