Current OBS is delivering hdrmd5 in buildinfo. It turns out
that osc has already code for validating cached files, but it
invalidates all local files atm with python 3.x
Using os.getcwd() in combination with a subsequent .encode() is error
prone:
marcus@linux:~> mkdir illegal_utf-8_encoding_$'\xff'_dir
marcus@linux:~> cd illegal_utf-8_encoding_$'\xff'_dir/
marcus@linux:~/illegal_utf-8_encoding_ÿ_dir> python3
Python 3.8.6 (default, Nov 09 2020, 12:09:06) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.getcwd().encode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 36: surrogates not allowed
>>>
Hence, use os.getcwdb(), which returns a bytes, instead of
os.getcwd().encode().
Fixes: commit 36f7b8ffe9 ("Fix a
potential TypeError in CpioRead.copyin and CpioRead.copyin_file")
If no dir is passed to util.ArFile.saveTo, dir is set to os.getcwd(),
which returns a str. Since self.name is a bytes, the subsequent
os.path.join(dir, self.name) results in a TypeError.
To fix this, use os.getcwdb(), which returns a bytes instead of a
str.
If no "dest" argument is specified when calling CpioRead.copyin or
CpioRead.copyin_file, a TypeError occurs in CpioRead._copyin_file
because os.getcwd(), which returns a str, is used as dest and, hence,
the subsequent os.path.join(...) fails (because it tries to join a
str and a bytes).
In order to avoid this, encode the result of os.getcwd().
Note that the existing
archive.copyin_file(hdr.filename,
os.path.dirname(tmpfile),
os.path.basename(tmpfile))
was OK because CpioRead._copyin_file os.path.join()s "dest" and
"new_fn", which are both str. It is just changed to stress that
CpioRead is a bytes-only API.
Fixes: #865 ("Traceback in osc/util/cpio.py line 128: TypeError:
Can't mix strings and bytes in path components")
In commit 276d6e2439 ("Do not use the
chardet module in util.helper.decode_it") util.helper.decode_it was
changed to always decode the passed object if it has a decode method.
Since a python2 str has a decode method, the new code tries to utf-8
decode the passed str. As a result, a unicode object is returned (if
the decoding worked). Since a unicode object is not an instance of
type str, all subsequent isinstance(decoded_obj, str) checks evaluate
to False, which break some codepaths.
In order to fix this, restore the old python2 behavior (that is, if
the passed object is a str, it is not decode it). This change does not
affect the python3 codepaths.
Fixes: #814 ("osc log | fails")
In general, decode_it is used to get a str from an arbitrary bytes
instance. For this, decode_it used the chardet module (if present)
to detect the underlying encoding (if the bytes instance corresponds
to a "supported" encoding). The drawback of this detection is that
it can take quite some time in case of a large bytes instance, which
represents no "supported" encoding (see #669 and #746).
Instead of doing a potentially "time consuming" detection, either
assume an utf-8 encoding or a latin-1 encoding. Rationale: it is just
not worth the effort to detect a _potential_ encoding because we have
no clue what the _correct_ encoding is. For instance, consider the
following bytes instance:
b'This character group is not supported: [abc\xc3\xbf]'
It represents a valid utf-8 and latin-1 encoding. What is the "correct"
one? We don't know... Even if you interpret the bytes instance as a
human you cannot give a definite answer (implicit assumption: there is
no additional context available).
That is, if we cannot give a definite answer in case of two potential
encodings, there is no point in bringing even more potential encodings
into play. Hence, do not use the chardet module.
Note: the rationale for trying utf-8 first is that utf-8 is pretty
much in vogue these days and, hence, the chances are "high" that we
guess the "correct" encoding.
Fixes: #669 ("check in huge shell archives is insanely slow")
Fixes: #746 ("Very slow local buildlog parsing")
Importing `cElementTree` has been deprecated since Python 3.3 -
importing `ElementTree` automatically uses the fastest
implementation available - and is finally removed in Python 3.9.
Importing cElementTree directly (not as part of xml) is an even
older relic, it's for Ye Time Before ElementTree Was Added To
Python and it was instead an external module...which was before
Python 2.5.
We still need to work with Python 2.7 for now, so we use a try/
except to handle both 2.7 and 3.9 cases. Also, let's not repeat
this import 12 times in one file for some reason.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
The repodata.RepoDataQueryResult is supposed to be a bytes API and
that's what our users (see build module) expect.
Note that the repodata.RepoDataQueryResult.path method still returns
a str. That's what the rpmquery.RpmQuery, debquery.DebQuery, and
archquery.ArchQuery classes also do (if the "path" was initially
passed as a str).
Fixes: #760 ("osc build fails when called with --prefer-pkgs where the
passed directory is a repodata repository or a subdirectory of one")
The packagequery.PackageQueryResult class is supposed to provide a
bytes API. Hence, packagequery.PackageQueryResult.evr() should return
bytes instead of a str. Also, adjust the single caller in the build
module.
This is a follow-up commit for commit
6dbf103e10 ("Use html.escape instead
removed cgi.escape"), which breaks the python2 backward compatibility
(since the "html" module is not available by default) and also breaks
the code in general (due to missing html imports).
The fix is based on the proposed fix in [1].
Fixes: boo#1166537 ("osc rq accept - forwarding request causes backtrace")
[1] https://github.com/openSUSE/osc/pull/764
The correct zst magic is b'(\xb5/\xfd' (4 bytes) (that's what obs-build
is also using).
Kudos to Tobias Ellinghaus for spotting this.
Fixes: #756 ("zst detection fails")
In some rare cases the chardet encoding detection detects
a wrong encoding standard. Then we switch to latin-1 which
covers most if utf-8 does not work.
No functional changes. Note that we cannot simply decode the control's
fields as ascii/utf-8 because a field is not necessarily a valid
ascii/utf-8 encoding (it is possible to register _arbitrary_ custom
fields via a 'register-custom-fields' hook when building a deb
package).
Note: DebQuery.debvercmp really deserves a cleanup:/
cmp(a, b) returns
-1 if a < b
0 if a == 0
1 if a > b
This is needed since python3 has no cmp function anymore.
All credits for this go to Marco Strigl <mstrigl@suse.com> (see
PR#483 [1]).
[1] https://github.com/openSUSE/osc/pull/483
The None argument is always <= than the other argument. We need this
in case of a broken/pathological package where version() or release()
return None (see vercmp (which calls rpmvercmp)).
Returning None breaks ArchQuery.vercmp. Returning b'0' is ok because
an epoch, if present, is always supposed to be an integer (at least
in a "valid" arch package (see scripts/libmakepkg/lint_pkgbuild/epoch.sh.in
in the pacman sources)). Hence, if we compare the epoch of a package,
which has no explicit epoch set, with the epoch of a package, which
has an explicit epoch set, we always have a <= relation.
Now, CpioWrite provides a bytes-only API. It would be also possible
that the API accepts bytes and str (we would need to explicitly
encode the latter) but this would be a bit inconsistent wrt.
cpio.CpioRead (which is bytes-only).
Also, by using a bytesarray instead of a [] we avoid several
intermediate ''.join(...)s.
This is a bytes only API because a filename in a cpio archive can
contain, for instance, illegal utf-8 sequences. A user can decode
the filename/content as she wishes.
A ValueError is more appropriate because there is no issue with the
ar archive itself. Also, the old codepath never worked because the
fn parameter was missing.
Since an ar archive can contain arbitary filenames (that is a
filename can be an invalid utf-8 encoding (for instance,
"foo\xff\xffbar")), the ar module provides a bytes only API. A
user can decode filenames as she wishes.
Note: if a "fn" parameter is passed to Ar.__init__ it should be a
bytes (a str is also ok, but then be aware that an ArError's file
attribute might be a str or a bytes).
There is no need to unpack a single byte because it is not
affected by (byte) endianness (and that's what struct.unpack is
about). Moreover, rpmquery.unpack_string now supports an optional
encoding parameter, which could be used by the python3 port to
decode a string. Note: in general we cannot assume that all strings
in a rpm are utf-8 encoded (it is possible to build a rpm that
contains illegal utf-8 sequences).
This functions are used in the whole code and are
mandatory for the python3 support to work. In python2
case nothing is touched.
* cmp_to_key:
converts a cmp= into a key= function
* decode_list:
decodes each element of a list. This is needed if
we have a mixed list with strings and bytes.
* decode_it:
Takes the input and checks if it is not a string.
Then it uses chardet to get the encoding.
Storing the error encoding in an "encoding" attribute "breaks" the
python3 "input" function: In essence, builtin_input_impl does a
getattr(sys.stdout, 'encoding'), which returns our error encoding
instead of the "real" stdout encoding. In order to avoid this, we
store the error encoding in an "_encoding" attribute.
Making SafeWriter a new-style class simplifies the code a lot.
The following abstract methods are added to the PackageQueryResult
class: recommends(), suggests(), supplements(), and enhances().
Note that not all package/metadata formats have a notion of these
weak dependencies.
rpm rpmmd deb arch
recommends x x x
suggests x x x x
supplements x x
enhances x x x
(where "x" represents "supported"). In case of an unsupported weak
dependency, the implementation returns an empty list.
We need the weak dependency support in order to fix#363 ("osc build
-p ../rpms/tw doesnt send recommends to the server which makes client
side build behave differently to server side build").
Similar to recent fixes in libsolv and obs-build. Since tarfile
on python2 doesn't do lzma, decompress the file into memory and
feed it as a fake file via StringIO to tarfile
The most visible change in python3 - removal of print statement and all
the crufty
print >> sys.stderr, foo,
The from __future__ import print_function makes it available in python
2.6
Some modules (httplib, StringIO, ...) were renamed in python3. This
patch try to import the proper symbols from python3 and then fallback to
python2 in a case ImportError will appear.
There is one exception, python 2.7 got the io module with StringIO, but
it allow unicode arguments only. Therefor the old module is poked before
new one.
this patch
1.) removes the iteritems/itervalues, which were dropped in py3
items/values are used instead
2.) add an extra list() in a cases the list-based access is needed
(included appending, indexing and so)
3.) changes a sorting idiom in few places
instead of
foo = dict.keys()
foo.sort()
for i in foo:
there is a recommended
for i in sorted(dict.keys()):
4.) in one occassion it removes a if dict.has_key() by simpler
dict.get(key, default)
- util/rpmquery:
* added new methods "is_src", "is_nosrc" to check if the package is
a src rpm or nosrc rpm
* fixed "canonname": this never worked for src- or nosrc rpms
- minor code restructuring
Note:
in order to fetch the cpio archives osc uses "getbinarylist". The
drawback is that "getbinarylist" doesn't generate an ".errors" file
if we're requesting a non-existent filename.
Any directory passed to --prefer-pkgs will be searched for a repodata
directory. If the directory does not contain a repodata directory, then
each ancestor directory is checked. This allows for the user error of
specifying an individual architecture directory (e.g. x86_64) instead of the
parent repository directory that contains the repodata:
repository/
x86_64/
*.rpm
repodata/
*.xml.gz
The use case for this feature is it allows snapshots of the OBS repositories
to be offloaded to an network-attached filesystem. repodata directories are
used as the xml.gz files are faster to read than the 100s of rpms in a given
snapshot. These snapshots are used to track older rpm sets that may be
deployed for testing.