In general, decode_it is used to get a str from an arbitrary bytes
instance. For this, decode_it used the chardet module (if present)
to detect the underlying encoding (if the bytes instance corresponds
to a "supported" encoding). The drawback of this detection is that
it can take quite some time in case of a large bytes instance, which
represents no "supported" encoding (see #669 and #746).
Instead of doing a potentially "time consuming" detection, either
assume an utf-8 encoding or a latin-1 encoding. Rationale: it is just
not worth the effort to detect a _potential_ encoding because we have
no clue what the _correct_ encoding is. For instance, consider the
following bytes instance:
b'This character group is not supported: [abc\xc3\xbf]'
It represents a valid utf-8 and latin-1 encoding. What is the "correct"
one? We don't know... Even if you interpret the bytes instance as a
human you cannot give a definite answer (implicit assumption: there is
no additional context available).
That is, if we cannot give a definite answer in case of two potential
encodings, there is no point in bringing even more potential encodings
into play. Hence, do not use the chardet module.
Note: the rationale for trying utf-8 first is that utf-8 is pretty
much in vogue these days and, hence, the chances are "high" that we
guess the "correct" encoding.
Fixes: #669 ("check in huge shell archives is insanely slow")
Fixes: #746 ("Very slow local buildlog parsing")
Importing `cElementTree` has been deprecated since Python 3.3 -
importing `ElementTree` automatically uses the fastest
implementation available - and is finally removed in Python 3.9.
Importing cElementTree directly (not as part of xml) is an even
older relic, it's for Ye Time Before ElementTree Was Added To
Python and it was instead an external module...which was before
Python 2.5.
We still need to work with Python 2.7 for now, so we use a try/
except to handle both 2.7 and 3.9 cases. Also, let's not repeat
this import 12 times in one file for some reason.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
The repodata.RepoDataQueryResult is supposed to be a bytes API and
that's what our users (see build module) expect.
Note that the repodata.RepoDataQueryResult.path method still returns
a str. That's what the rpmquery.RpmQuery, debquery.DebQuery, and
archquery.ArchQuery classes also do (if the "path" was initially
passed as a str).
Fixes: #760 ("osc build fails when called with --prefer-pkgs where the
passed directory is a repodata repository or a subdirectory of one")
The packagequery.PackageQueryResult class is supposed to provide a
bytes API. Hence, packagequery.PackageQueryResult.evr() should return
bytes instead of a str. Also, adjust the single caller in the build
module.
This is a follow-up commit for commit
6dbf103e10 ("Use html.escape instead
removed cgi.escape"), which breaks the python2 backward compatibility
(since the "html" module is not available by default) and also breaks
the code in general (due to missing html imports).
The fix is based on the proposed fix in [1].
Fixes: boo#1166537 ("osc rq accept - forwarding request causes backtrace")
[1] https://github.com/openSUSE/osc/pull/764
The correct zst magic is b'(\xb5/\xfd' (4 bytes) (that's what obs-build
is also using).
Kudos to Tobias Ellinghaus for spotting this.
Fixes: #756 ("zst detection fails")
In some rare cases the chardet encoding detection detects
a wrong encoding standard. Then we switch to latin-1 which
covers most if utf-8 does not work.
No functional changes. Note that we cannot simply decode the control's
fields as ascii/utf-8 because a field is not necessarily a valid
ascii/utf-8 encoding (it is possible to register _arbitrary_ custom
fields via a 'register-custom-fields' hook when building a deb
package).
Note: DebQuery.debvercmp really deserves a cleanup:/
cmp(a, b) returns
-1 if a < b
0 if a == 0
1 if a > b
This is needed since python3 has no cmp function anymore.
All credits for this go to Marco Strigl <mstrigl@suse.com> (see
PR#483 [1]).
[1] https://github.com/openSUSE/osc/pull/483
The None argument is always <= than the other argument. We need this
in case of a broken/pathological package where version() or release()
return None (see vercmp (which calls rpmvercmp)).
Returning None breaks ArchQuery.vercmp. Returning b'0' is ok because
an epoch, if present, is always supposed to be an integer (at least
in a "valid" arch package (see scripts/libmakepkg/lint_pkgbuild/epoch.sh.in
in the pacman sources)). Hence, if we compare the epoch of a package,
which has no explicit epoch set, with the epoch of a package, which
has an explicit epoch set, we always have a <= relation.
Now, CpioWrite provides a bytes-only API. It would be also possible
that the API accepts bytes and str (we would need to explicitly
encode the latter) but this would be a bit inconsistent wrt.
cpio.CpioRead (which is bytes-only).
Also, by using a bytesarray instead of a [] we avoid several
intermediate ''.join(...)s.
This is a bytes only API because a filename in a cpio archive can
contain, for instance, illegal utf-8 sequences. A user can decode
the filename/content as she wishes.
A ValueError is more appropriate because there is no issue with the
ar archive itself. Also, the old codepath never worked because the
fn parameter was missing.
Since an ar archive can contain arbitary filenames (that is a
filename can be an invalid utf-8 encoding (for instance,
"foo\xff\xffbar")), the ar module provides a bytes only API. A
user can decode filenames as she wishes.
Note: if a "fn" parameter is passed to Ar.__init__ it should be a
bytes (a str is also ok, but then be aware that an ArError's file
attribute might be a str or a bytes).
There is no need to unpack a single byte because it is not
affected by (byte) endianness (and that's what struct.unpack is
about). Moreover, rpmquery.unpack_string now supports an optional
encoding parameter, which could be used by the python3 port to
decode a string. Note: in general we cannot assume that all strings
in a rpm are utf-8 encoded (it is possible to build a rpm that
contains illegal utf-8 sequences).