Sync from SUSE:SLFO:Main python-fastparquet revision 136ce88c10ae31bf7d1802e80f6b9f42

This commit is contained in:
Adrian Schröter 2024-05-03 20:39:35 +02:00
commit 0ea456fc73
4 changed files with 632 additions and 0 deletions

23
.gitattributes vendored Normal file
View File

@ -0,0 +1,23 @@
## Default LFS
*.7z filter=lfs diff=lfs merge=lfs -text
*.bsp filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.gem filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.jar filter=lfs diff=lfs merge=lfs -text
*.lz filter=lfs diff=lfs merge=lfs -text
*.lzma filter=lfs diff=lfs merge=lfs -text
*.obscpio filter=lfs diff=lfs merge=lfs -text
*.oxt filter=lfs diff=lfs merge=lfs -text
*.pdf filter=lfs diff=lfs merge=lfs -text
*.png filter=lfs diff=lfs merge=lfs -text
*.rpm filter=lfs diff=lfs merge=lfs -text
*.tbz filter=lfs diff=lfs merge=lfs -text
*.tbz2 filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.ttf filter=lfs diff=lfs merge=lfs -text
*.txz filter=lfs diff=lfs merge=lfs -text
*.whl filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text

BIN
fastparquet-2023.10.1.tar.gz (Stored with Git LFS) Normal file

Binary file not shown.

514
python-fastparquet.changes Normal file
View File

@ -0,0 +1,514 @@
-------------------------------------------------------------------
Mon Jan 22 12:41:30 UTC 2024 - Daniel Garcia <daniel.garcia@suse.com>
- Do not run tests in s390x, bsc#1218603
-------------------------------------------------------------------
Tue Dec 5 12:23:32 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 2023.10.0:
* Datetime units in empty() with tz (#893)
* Fewer inplace decompressions for V2 pages (#890
* Allow writing categorical column with no categories (#888)
* Fixes for new numpy (#886)
* RLE bools and DELTA for v1 pages (#885, 883)
-------------------------------------------------------------------
Mon Sep 11 21:29:16 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 2023.8.0:
* More general timestamp units (#874)
* ReadTheDocs V2 (#871)
* Better roundtrip dtypes (#861, 859)
* No convert when computing bytes-per-item for str (#858)
-------------------------------------------------------------------
Sat Jul 1 20:05:36 UTC 2023 - Arun Persaud <arun@gmx.de>
- update to version 2023.7.0:
* Add test case for reading non-pandas parquet file (#870)
* Extra field when cloning ParquetFile (#866)
-------------------------------------------------------------------
Fri Apr 28 08:10:46 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 2023.4.0:
* allow loading categoricals even if not so in the pandas metadata,
when a column is dict-encodedand we only have one row-group (#863)
* apply dtype to the columns names series, even when selecting no
columns (#861, 859)
* don't make strings while estimating bye column size (#858)
* handle upstream depr (#857, 856)
-------------------------------------------------------------------
Thu Feb 9 15:55:08 UTC 2023 - Arun Persaud <arun@gmx.de>
- update to version 2023.2.0:
* revert one-level set of filters (#852)
* full size dict for decoding V2 pages (#850)
* infer_object_encoding fix (#847)
* row filtering with V2 pages (#845)
-------------------------------------------------------------------
Wed Feb 8 18:25:03 UTC 2023 - Arun Persaud <arun@gmx.de>
- specfile:
* remove fastparquet-pr835.patch, implemented upstream
- update to version 2023.1.0:
* big improvement to write speed
* paging support for bigger row-groups
* pandas 2.0 support
* delta for big-endian architecture
-------------------------------------------------------------------
Mon Jan 2 20:38:49 UTC 2023 - Ben Greiner <code@bnavigator.de>
- Update to 2022.12.0
* check all int32 values before passing to thrift writer
* fix type of num_rows to i64 for big single file
- Release 2022.11.0
* Switch to calver
* Speed up loading of nullable types
* Allow schema evolution by addition of columns
* Allow specifying dtypes of output
* update to scm versioning
* fixes to row filter, statistics and tests
* support pathlib.Paths
* JSON encoder options
- Drop fastparquet-pr813-updatefixes.patch
-------------------------------------------------------------------
Fri Dec 23 09:18:39 UTC 2022 - Guillaume GARDET <guillaume.gardet@opensuse.org>
- Add patch to fox the test test_delta_from_def_2 on
aarch64, armv7 and ppc64le:
* fastparquet-pr835.patch
-------------------------------------------------------------------
Fri Oct 28 15:47:41 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 0.8.3
* improved key/value handling and rejection of bad types
* fix regression in consolidate_cats (caught in dask tests)
- Release 0.8.2
* datetime indexes initialised to 0 to prevent overflow from
randommemory
* case from csv_to_parquet where stats exists but has not nulls
entry
* define len and bool for ParquetFile
* maintain int types of optional data tha came from pandas
* fix for delta encoding
- Add fastparquet-pr813-updatefixes.patch gh#dask/fastparquet#813
-------------------------------------------------------------------
Tue Apr 26 11:02:27 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 0.8.1
* fix critical buffer overflow crash for large number of columns
and long column names
* metadata handling
* thrift int32 for list
* avoid error storing NaNs in column stats
-------------------------------------------------------------------
Sat Jan 29 21:36:38 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 0.8.0
* our own cythonic thrift implementation (drop thrift dependency)
* more in-place dataset editing ad reordering
* python 3.10 support
* fixes for multi-index and pandas types
- Clean test skips
-------------------------------------------------------------------
Sun Jan 16 13:34:53 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Clean specfile from unused python36 conditionals
- Require thrift 0.15.0 (+patch) for Python 3.10 compatibility
* gh#dask/fastparquet#514
-------------------------------------------------------------------
Sat Nov 27 20:34:53 UTC 2021 - Arun Persaud <arun@gmx.de>
- update to version 0.7.2:
* Ability to remove row-groups in-place for multifile datasets
* Accept pandas nullable Float type
* allow empty strings and fix min/max when there is no data
* make writing statistics optional
* row selection in to_pandas()
-------------------------------------------------------------------
Sun Aug 8 15:13:55 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Update to version 0.7.1
* Back compile for older versions of numpy
* Make pandas nullable types opt-out. The old behaviour (casting
to float) is still available with ParquetFile(...,
pandas_nulls=False).
* Fix time field regression: IsAdjustedToUTC will be False when
there is no timezone
* Micro improvements to the speed of ParquetFile creation by
using simple simple string ops instead of regex and
regularising filenames once at the start. Effects datasets with
many files.
- Release 0.7.0
* This version institutes major, breaking changes, listed here,
and incremental fixes and additions.
* Reading a directory without a _metadata summary file now works
by providing only the directory, instead of a list of
constituent files. This change also makes direct of use of
fsspec filesystems, if given, to be able to load the footer
metadata areas of the files concurrently, if the storage
backend supports it, and not directly instantiating
intermediate ParquetFile instances
* row-level filtering of the data. Whereas previously, only full
row-groups could be excluded on the basis of their parquet
metadata statistics (if present), filtering can now be done
within row-groups too. The syntax is the same as before,
allowing for multiple column expressions to be combined with
AND|OR, depending on the list structure. This mechanism
requires two passes: one to load the columns needed to create
the boolean mask, and another to load the columns actually
needed in the output. This will not be faster, and may be
slower, but in some cases can save significant memory
footprint, if a small fraction of rows are considered good and
the columns for the filter expression are not in the output.
Not currently supported for reading with DataPageV2.
* DELTA integer encoding (read-only): experimentally working,
but we only have one test file to verify against, since it is
not trivial to persuade Spark to produce files encoded this
way. DELTA can be extremely compact a representation for
slowly varying and/or monotonically increasing integers.
* nanosecond resolution times: the new extended "logical" types
system supports nanoseconds alongside the previous millis and
micros. We now emit these for the default pandas time type,
and produce full parquet schema including both "converted" and
"logical" type information. Note that all output has
isAdjustedToUTC=True, i.e., these are timestamps rather than
local time. The time-zone is stored in the metadata, as
before, and will be successfully recreated only in fastparquet
and (py)arrow. Otherwise, the times will appear to be UTC. For
compatibility with Spark, you may still want to use
times="int96" when writing.
* DataPageV2 writing: now we support both reading and writing.
For writing, can be enabled with the environment variable
FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer.
DATAPAGE_VERSION and is off by default. It will become on by
default in the future. In many cases, V2 will result in better
read performance, because the data and page headers are
encoded separately, so data can be directly read into the
output without addition allocation/copies. This feature is
considered experimental, but we believe it working well for
most use cases (i.e., our test suite) and should be readable
by all modern parquet frameworks including arrow and spark.
* pandas nullable types: pandas supports "masked" extension
arrays for types that previously could not support NULL at
all: ints and bools. Fastparquet used to cast such columns to
float, so that we could represent NULLs as NaN; now we use the
new(er) masked types by default. This means faster reading of
such columns, as there is no conversion. If the metadata
guarantees that there are no nulls, we still use the
non-nullable variant unless the data was written with
fastparquet/pyarrow, and the metadata indicates that the
original datatype was nullable. We already handled writing of
nullable columns.
-------------------------------------------------------------------
Tue May 18 14:41:46 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Update to version 0.6.3
* no release notes
* new requirement: cramjam instead of separate compression libs
and their bindings
* switch from numba to Cython
-------------------------------------------------------------------
Fri Feb 12 14:50:18 UTC 2021 - Dirk Müller <dmueller@suse.com>
- skip python 36 build
-------------------------------------------------------------------
Thu Feb 4 17:50:32 UTC 2021 - Jan Engelhardt <jengelh@inai.de>
- Use of "+=" in %check warrants bash as buildshell.
-------------------------------------------------------------------
Wed Feb 3 21:43:10 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Skip the import without warning test gh#dask/fastparquet#558
- Apply the Cepl-Strangelove-Parameter to pytest
(--import-mode append)
-------------------------------------------------------------------
Sat Jan 2 21:04:30 UTC 2021 - Benjamin Greiner <code@bnavigator.de>
- update to version 0.5
* no changelog
- update test suite setup -- install the .test module
-------------------------------------------------------------------
Sat Jul 18 18:13:53 UTC 2020 - Arun Persaud <arun@gmx.de>
- specfile:
* update requirements: version numbers and added packaging
- update to version 0.4.1:
* nulls, fixes #504
* deps: Add missing dependency on packaging. (#502)
-------------------------------------------------------------------
Thu Jul 9 14:04:10 UTC 2020 - Marketa Calabkova <mcalabkova@suse.com>
- Update to 0.4.0
* Changed RangeIndex private methods to public ones
* Use the python executable used to run the code
* Add support for Python 3.8
* support for numba > 0.48
- drop upstreamed patch use-python-exec.patch
-------------------------------------------------------------------
Mon Apr 6 06:54:36 UTC 2020 - Tomáš Chvátal <tchvatal@suse.com>
- Add patch to use sys.executable and not call py2 binary directly:
* use-python-exec.patch
-------------------------------------------------------------------
Mon Apr 6 06:50:26 UTC 2020 - Tomáš Chvátal <tchvatal@suse.com>
- Update to 0.3.3:
* no upstream changelog
-------------------------------------------------------------------
Fri Oct 25 17:50:50 UTC 2019 - Todd R <toddrme2178@gmail.com>
- Drop broken python 2 support.
- Testing fixes
-------------------------------------------------------------------
Sat Aug 3 15:10:41 UTC 2019 - Arun Persaud <arun@gmx.de>
- update to version 0.3.2:
* Only calculate dataset stats once (#453)
* Fixes #436 (#452)
* Fix a crash if trying to read a file whose created_by value is not
set
* COMPAT: Fix for pandas DeprecationWarning (#446)
* Apply timezone to index (#439)
* Handle NaN partition values (#438)
* Pandas meta (#431)
* Only strip _metadata from end of file path (#430)
* Simple nesting fix (#428)
* Disallow bad tz on save, warn on load (#427)
-------------------------------------------------------------------
Tue Jul 30 14:23:21 UTC 2019 - Todd R <toddrme2178@gmail.com>
- Fix spurious test failure
-------------------------------------------------------------------
Mon May 20 15:12:11 CEST 2019 - Matej Cepl <mcepl@suse.com>
- Clean up SPEC file.
-------------------------------------------------------------------
Tue Apr 30 14:28:46 UTC 2019 - Todd R <toddrme2178@gmail.com>
- update to 0.3.1
* Add schema == (__eq__) and != (__ne__) methods and tests.
* Fix item iteration for decimals
* List missing columns in error message
* Fix tz being None case
- Update to 0.3.0
* Squash some warnings and import failures
* Improvements to in and not in operators
* Fixes because pandas released
-------------------------------------------------------------------
Sat Jan 26 17:05:09 UTC 2019 - Arun Persaud <arun@gmx.de>
- specfile:
* update copyright year
- update to version 0.2.1:
* Compat for pandas 0.24.0 refactor (#390)
* Change OverflowError message when failing on large pages (#387)
* Allow for changes in dictionary while reading a row-group column
(#367)
* Correct pypi project names for compression libraries (#385)
-------------------------------------------------------------------
Thu Nov 22 22:47:24 UTC 2018 - Arun Persaud <arun@gmx.de>
- update to version 0.2.0:
* Don't mutate column list input (#383) (#384)
* Add optional requirements to extras_require (#380)
* Fix "broken link to parquet-format page" (#377)
* Add .c file to repo
* Handle rows split across 2 pages in the case of a map (#369)
* Fixes 370 (#371)
* Handle multi-page maps (#368)
* Handle zero-column files. Closes #361. (#363)
-------------------------------------------------------------------
Sun Sep 30 16:22:56 UTC 2018 - Arun Persaud <arun@gmx.de>
- specfile:
* update url
* make %files section more specific
- update to version 0.1.6:
* Restrict what categories get passed through (#358)
* Deep digging for multi-indexes (#356)
* allow_empty is the default in >=zstandard-0.9 (#355)
* Remove setup_requires from setup.py (#345)
* Fixed error if a certain partition is empty, when writing a
partioned (#347)
* Allow UTF8 column names to be read (#342)
* readd test file
* Allow for NULL converted type (#340)
* Robust partition names (#336)
* Fix accidental multiindex
* Read multi indexes (#331)
* Allow reading from any file-like (#330)
* change `parquet-format` link to apache repo (#328)
* Remove extra space from api.py (#325)
* numba bool fun (#324)
- changes from version 0.1.5:
* Fix _dtypes to be more efficient, to work with files with lots of
columns (#318)
* Buildfix (#313)
* Use LZ4 block compression for compatibility with parquet-cpp
(#314) (#315)
* Fix typo in ParquetFile docstring (#312)
* Remove annoying print() when reading file with CategoricalDtype
index (#311)
* Allow lists of multi-file data-sets (#309)
* Acceleate dataframe.empty for small/medium sizes (#307)
* Include dictionary page in column size (#306)
* Fix for selecting columns which were used for partitioning (#304)
* Remove occurances of np.fromstring (#303)
* Add support for zstandard compression (#296)
* Int96time order (#298)
- changes from version 0.1.4:
* Add handling of keyword arguments for compressor (#294)
* Fix setup.py duplication (#295)
* Integrate pytest with setup.py (#293)
* Get setup.py pytest to work. (#287)
* Add LZ4 support (#292)
* Update for forthcoming thrift release (#281)
* If timezones are in pandas metadata, assign columns as required
(#285)
* Pandas import (#284)
* Copy FMDs instead of mutate (#279)
* small fixes (#278)
* fixes to get benchmark to work (#276)
* backwards compat with Dask
* Fix test_time_millis on Windows (#275)
* join paths os-independently (#271)
* Adds int32 support for object encoding (#268)
* Fix a couple small typos in documentation (#267)
* Partition order should be sorted (#265)
* COMPAT: Update thrift (#264)
* Speedups result (#253)
* Remove thrift_copy
* Define `__copy__` on thrift structures
* Update rtd deps
- changes from version 0.1.3:
* More care over append when partitioning multiple columns
* Sep for windows cats filtering
* Move pytest imports to tests/ remove requirememnt
* Special-case only zeros
* Cope with partition values like "07"
* fix for s3
* Fix for list of paths rooted in the current directory
* add test
* Explicit file opens
* update docstring
* Refactor partition interpretation
* py2 fix
* Error in test changed
* Better error messages when failed to cnovert on write
- changes from version 0.1.2:
* Revert accidental removal of s3 import
* Move thrift things together, and make thrift serializer for pickle
* COMPAT: for new pandas CategoricalDtype
* Fixup for backwards seeking.
* Fix some test failures
* Protptype version using thrift instead of thriftpy
* Not all mergers have cats
* Revert accidental deletion
* remove warnings
* Sort keys in json for metadata
* Check column chunks for categories sizes
* Account for partition dir names with numbers
* Fix map/list doc
* Catch more stats errors
* Prevent pandas auto-names being given to index
- changes from version 0.1.1:
* Add workaround for single-value-partition
* update test
* Simplify and fix for py2
* Use thrift encoding on statistics strings
* remove redundant SNAPPY from supported compressions list
* Fix statistics
* lists again
* Always convert int96 to times
* Update docs
* attribute typo
* Fix definition level
* Add test, clean columns
* Allow optional->optional lists and maps
* Flatten schema to enable loading of non-repeated columns
* Remove extra file
* Fix py2
* Fix "in" filter to cope with strings that could be numbers
* Allow pip install without NumPy or Cython
- changes from version 0.1.0:
* Add ParquetFile attribute documentation
* Fix tests
* Enable append to an empty dataset
* More warning words and check on partition_on
* Do not fail stats if there are no row-groups
* Fix "numpy_dtype"->"numpy_type
* "in" was checking range not exact membership of set
* If metadata gives index, put in columns
* Fix pytest warning
* Fail on ordering dict statistics
* Fix stats filter
* clean test
* Fix ImportWarning on Python 3.6+
* TEST: added updated test file for special strings used in filters
* fix links
* [README]: indicate dependency on LLVM 4.0.x.
* Filter stats had unfortunate converted_type check
* Ignore exceptions in val_to_num
* Also for TODAY
* Very special case for partition: NOW should be kept as string
* Allow partition_on; fix category nuls
* Remove old category key/values on writing
* Implement writing pandas metadata and auto-setting cats/index
* Pandas compatability
* Test and fix for filter on single file
* Do not attempt to recurse into schema elements with zero childrean
-------------------------------------------------------------------
Thu Jun 7 20:41:31 UTC 2018 - jengelh@inai.de
- Fixup grammar./Replace future aims with what it does now.
-------------------------------------------------------------------
Thu May 3 14:07:08 UTC 2018 - toddrme2178@gmail.com
- Use %license tag
-------------------------------------------------------------------
Thu May 25 12:19:26 UTC 2017 - toddrme2178@gmail.com
- Initial version

92
python-fastparquet.spec Normal file
View File

@ -0,0 +1,92 @@
#
# spec file for package python-fastparquet
#
# Copyright (c) 2024 SUSE LLC
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
# upon. The license for this file, and modifications and additions to the
# file, is the same license as for the pristine package itself (unless the
# license for the pristine package is not an Open Source License, in which
# case the license is the MIT License). An "Open Source License" is a
# license that conforms to the Open Source Definition (Version 1.9)
# published by the Open Source Initiative.
# Please submit bugfixes or comments via https://bugs.opensuse.org/
#
%{?sle15_python_module_pythons}
Name: python-fastparquet
Version: 2023.10.1
Release: 0
Summary: Python support for Parquet file format
License: Apache-2.0
URL: https://github.com/dask/fastparquet/
# Use GitHub archive, because it containts the test modules and data, requires setting version manuall for setuptools_scm
Source: https://github.com/dask/fastparquet/archive/%{version}.tar.gz#/fastparquet-%{version}.tar.gz
BuildRequires: %{python_module Cython >= 0.29.23}
BuildRequires: %{python_module base >= 3.8}
BuildRequires: %{python_module cramjam >= 2.3.0}
# version requirement not declared for runtime, but necessary for tests.
BuildRequires: %{python_module fsspec >= 2021.6.0}
BuildRequires: %{python_module numpy-devel >= 1.20.3}
BuildRequires: %{python_module packaging}
BuildRequires: %{python_module pandas >= 1.5.0}
BuildRequires: %{python_module pip}
BuildRequires: %{python_module pytest-asyncio}
BuildRequires: %{python_module pytest-xdist}
BuildRequires: %{python_module pytest}
BuildRequires: %{python_module python-lzo}
BuildRequires: %{python_module setuptools_scm > 1.5.4}
BuildRequires: %{python_module setuptools}
BuildRequires: %{python_module wheel}
BuildRequires: fdupes
BuildRequires: git-core
BuildRequires: python-rpm-macros
Requires: python-cramjam >= 2.3.0
Requires: python-fsspec
Requires: python-numpy >= 1.20.3
Requires: python-packaging
Requires: python-pandas >= 1.5.0
Recommends: python-python-lzo
%python_subpackages
%description
This is a Python implementation of the parquet format
for integrating it into python-based Big Data workflows.
%prep
%autosetup -p1 -n fastparquet-%{version}
# remove pytest-runner from setup_requires
sed -i "s/'pytest-runner',//" setup.py
# this is not meant for setup.py
sed -i "s/oldest-supported-numpy/numpy/" setup.py
# the tests import the fastparquet.test module and we need to import from sitearch, so install it.
sed -i -e "s/^\s*packages=\[/&'fastparquet.test', /" -e "/exclude_package_data/ d" setup.py
%build
export CFLAGS="%{optflags}"
export SETUPTOOLS_SCM_PRETEND_VERSION=%{version}
%pyproject_wheel
%install
%pyproject_install
%python_expand rm -v %{buildroot}%{$python_sitearch}/fastparquet/{speedups,cencoding}.c
%python_expand %fdupes %{buildroot}%{$python_sitearch}
%check
%ifarch s390x
# Test suite is not working correctly in s390x so not running it.
echo "Not running tests for s390x"
%else
%pytest_arch --pyargs fastparquet --import-mode append -n auto
%endif
%files %{python_files}
%doc README.rst
%license LICENSE
%{python_sitearch}/fastparquet
%{python_sitearch}/fastparquet-%{version}*-info
%changelog