2018-05-15 10:08:00 +02:00
|
|
|
#
|
|
|
|
# spec file for package python-fastparquet
|
|
|
|
#
|
2023-01-03 08:37:17 +01:00
|
|
|
# Copyright (c) 2023 SUSE LLC
|
2018-05-15 10:08:00 +02:00
|
|
|
#
|
|
|
|
# All modifications and additions to the file contributed by third parties
|
|
|
|
# remain the property of their copyright owners, unless otherwise agreed
|
|
|
|
# upon. The license for this file, and modifications and additions to the
|
|
|
|
# file, is the same license as for the pristine package itself (unless the
|
|
|
|
# license for the pristine package is not an Open Source License, in which
|
|
|
|
# case the license is the MIT License). An "Open Source License" is a
|
|
|
|
# license that conforms to the Open Source Definition (Version 1.9)
|
|
|
|
# published by the Open Source Initiative.
|
|
|
|
|
2018-10-02 19:47:24 +02:00
|
|
|
# Please submit bugfixes or comments via https://bugs.opensuse.org/
|
2018-05-15 10:08:00 +02:00
|
|
|
#
|
|
|
|
|
|
|
|
|
|
|
|
Name: python-fastparquet
|
- update to 2023.8.0:
* More general timestamp units (#874)
* ReadTheDocs V2 (#871)
* Better roundtrip dtypes (#861, 859)
* No convert when computing bytes-per-item for str (#858)
- Add patch to fox the test test_delta_from_def_2 on
* row-level filtering of the data. Whereas previously, only full
row-groups could be excluded on the basis of their parquet
metadata statistics (if present), filtering can now be done
within row-groups too. The syntax is the same as before,
allowing for multiple column expressions to be combined with
AND|OR, depending on the list structure. This mechanism
requires two passes: one to load the columns needed to create
the boolean mask, and another to load the columns actually
needed in the output. This will not be faster, and may be
slower, but in some cases can save significant memory
footprint, if a small fraction of rows are considered good and
the columns for the filter expression are not in the output.
* DELTA integer encoding (read-only): experimentally working,
but we only have one test file to verify against, since it is
not trivial to persuade Spark to produce files encoded this
way. DELTA can be extremely compact a representation for
* nanosecond resolution times: the new extended "logical" types
system supports nanoseconds alongside the previous millis and
micros. We now emit these for the default pandas time type,
and produce full parquet schema including both "converted" and
"logical" type information. Note that all output has
isAdjustedToUTC=True, i.e., these are timestamps rather than
local time. The time-zone is stored in the metadata, as
before, and will be successfully recreated only in fastparquet
OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
|
|
|
Version: 2023.8.0
|
2018-05-15 10:08:00 +02:00
|
|
|
Release: 0
|
|
|
|
Summary: Python support for Parquet file format
|
|
|
|
License: Apache-2.0
|
2018-10-02 19:47:24 +02:00
|
|
|
URL: https://github.com/dask/fastparquet/
|
2023-01-03 08:37:17 +01:00
|
|
|
# Use GitHub archive, because it containts the test modules and data, requires setting version manuall for setuptools_scm
|
2019-04-30 22:10:30 +02:00
|
|
|
Source: https://github.com/dask/fastparquet/archive/%{version}.tar.gz#/fastparquet-%{version}.tar.gz
|
2023-02-12 23:54:14 +01:00
|
|
|
BuildRequires: %{python_module Cython >= 0.29.23}
|
2022-10-31 10:54:02 +01:00
|
|
|
BuildRequires: %{python_module base >= 3.8}
|
2021-05-19 12:12:11 +02:00
|
|
|
BuildRequires: %{python_module cramjam >= 2.3.0}
|
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric
- Update to version 0.7.1
* Back compile for older versions of numpy
* Make pandas nullable types opt-out. The old behaviour (casting
to float) is still available with ParquetFile(...,
pandas_nulls=False).
* Fix time field regression: IsAdjustedToUTC will be False when
there is no timezone
* Micro improvements to the speed of ParquetFile creation by
using simple simple string ops instead of regex and
regularising filenames once at the start. Effects datasets with
many files.
- Release 0.7.0
* This version institutes major, breaking changes, listed here,
and incremental fixes and additions.
* Reading a directory without a _metadata summary file now works
by providing only the directory, instead of a list of
constituent files. This change also makes direct of use of
fsspec filesystems, if given, to be able to load the footer
metadata areas of the files concurrently, if the storage
backend supports it, and not directly instantiating
intermediate ParquetFile instances
* row-level filtering of the data. Whereas previously, only full
row-groups could be excluded on the basis of their parquet
metadata statistics (if present), filtering can now be done
within row-groups too. The syntax is the same as before,
allowing for multiple column expressions to be combined with
AND|OR, depending on the list structure. This mechanism
requires two passes: one to load the columns needed to create
the boolean mask, and another to load the columns actually
needed in the output. This will not be faster, and may be
slower, but in some cases can save significant memory
footprint, if a small fraction of rows are considered good and
the columns for the filter expression are not in the output.
Not currently supported for reading with DataPageV2.
* DELTA integer encoding (read-only): experimentally working,
but we only have one test file to verify against, since it is
not trivial to persuade Spark to produce files encoded this
way. DELTA can be extremely compact a representation for
slowly varying and/or monotonically increasing integers.
* nanosecond resolution times: the new extended "logical" types
system supports nanoseconds alongside the previous millis and
micros. We now emit these for the default pandas time type,
and produce full parquet schema including both "converted" and
"logical" type information. Note that all output has
isAdjustedToUTC=True, i.e., these are timestamps rather than
local time. The time-zone is stored in the metadata, as
before, and will be successfully recreated only in fastparquet
and (py)arrow. Otherwise, the times will appear to be UTC. For
compatibility with Spark, you may still want to use
times="int96" when writing.
* DataPageV2 writing: now we support both reading and writing.
For writing, can be enabled with the environment variable
FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer.
DATAPAGE_VERSION and is off by default. It will become on by
default in the future. In many cases, V2 will result in better
read performance, because the data and page headers are
encoded separately, so data can be directly read into the
output without addition allocation/copies. This feature is
considered experimental, but we believe it working well for
most use cases (i.e., our test suite) and should be readable
by all modern parquet frameworks including arrow and spark.
* pandas nullable types: pandas supports "masked" extension
arrays for types that previously could not support NULL at
all: ints and bools. Fastparquet used to cast such columns to
float, so that we could represent NULLs as NaN; now we use the
new(er) masked types by default. This means faster reading of
such columns, as there is no conversion. If the metadata
guarantees that there are no nulls, we still use the
non-nullable variant unless the data was written with
fastparquet/pyarrow, and the metadata indicates that the
original datatype was nullable. We already handled writing of
nullable columns.
OBS-URL: https://build.opensuse.org/request/show/910725
OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
|
|
|
# version requirement not declared for runtime, but necessary for tests.
|
|
|
|
BuildRequires: %{python_module fsspec >= 2021.6.0}
|
2022-10-31 10:54:02 +01:00
|
|
|
BuildRequires: %{python_module numpy-devel >= 1.20.3}
|
|
|
|
BuildRequires: %{python_module packaging}
|
|
|
|
BuildRequires: %{python_module pandas >= 1.5.0}
|
2023-01-03 08:37:17 +01:00
|
|
|
BuildRequires: %{python_module pip}
|
2022-04-26 16:21:38 +02:00
|
|
|
BuildRequires: %{python_module pytest-asyncio}
|
|
|
|
BuildRequires: %{python_module pytest-xdist}
|
2019-05-20 15:24:28 +02:00
|
|
|
BuildRequires: %{python_module pytest}
|
2020-04-06 09:07:54 +02:00
|
|
|
BuildRequires: %{python_module python-lzo}
|
2023-01-03 08:37:17 +01:00
|
|
|
BuildRequires: %{python_module setuptools_scm > 1.5.4}
|
2018-05-15 10:08:00 +02:00
|
|
|
BuildRequires: %{python_module setuptools}
|
2023-01-03 08:37:17 +01:00
|
|
|
BuildRequires: %{python_module wheel}
|
2018-05-15 10:08:00 +02:00
|
|
|
BuildRequires: fdupes
|
2023-01-03 08:37:17 +01:00
|
|
|
BuildRequires: git-core
|
2018-05-15 10:08:00 +02:00
|
|
|
BuildRequires: python-rpm-macros
|
2021-05-19 12:12:11 +02:00
|
|
|
Requires: python-cramjam >= 2.3.0
|
|
|
|
Requires: python-fsspec
|
2022-10-31 10:54:02 +01:00
|
|
|
Requires: python-numpy >= 1.20.3
|
|
|
|
Requires: python-packaging
|
|
|
|
Requires: python-pandas >= 1.5.0
|
2018-10-02 19:47:24 +02:00
|
|
|
Recommends: python-python-lzo
|
2018-05-15 10:08:00 +02:00
|
|
|
%python_subpackages
|
|
|
|
|
|
|
|
%description
|
2018-06-13 15:37:19 +02:00
|
|
|
This is a Python implementation of the parquet format
|
|
|
|
for integrating it into python-based Big Data workflows.
|
2018-05-15 10:08:00 +02:00
|
|
|
|
|
|
|
%prep
|
2022-10-31 10:54:02 +01:00
|
|
|
%autosetup -p1 -n fastparquet-%{version}
|
2021-01-03 11:17:10 +01:00
|
|
|
# remove pytest-runner from setup_requires
|
|
|
|
sed -i "s/'pytest-runner',//" setup.py
|
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric
- Update to version 0.7.1
* Back compile for older versions of numpy
* Make pandas nullable types opt-out. The old behaviour (casting
to float) is still available with ParquetFile(...,
pandas_nulls=False).
* Fix time field regression: IsAdjustedToUTC will be False when
there is no timezone
* Micro improvements to the speed of ParquetFile creation by
using simple simple string ops instead of regex and
regularising filenames once at the start. Effects datasets with
many files.
- Release 0.7.0
* This version institutes major, breaking changes, listed here,
and incremental fixes and additions.
* Reading a directory without a _metadata summary file now works
by providing only the directory, instead of a list of
constituent files. This change also makes direct of use of
fsspec filesystems, if given, to be able to load the footer
metadata areas of the files concurrently, if the storage
backend supports it, and not directly instantiating
intermediate ParquetFile instances
* row-level filtering of the data. Whereas previously, only full
row-groups could be excluded on the basis of their parquet
metadata statistics (if present), filtering can now be done
within row-groups too. The syntax is the same as before,
allowing for multiple column expressions to be combined with
AND|OR, depending on the list structure. This mechanism
requires two passes: one to load the columns needed to create
the boolean mask, and another to load the columns actually
needed in the output. This will not be faster, and may be
slower, but in some cases can save significant memory
footprint, if a small fraction of rows are considered good and
the columns for the filter expression are not in the output.
Not currently supported for reading with DataPageV2.
* DELTA integer encoding (read-only): experimentally working,
but we only have one test file to verify against, since it is
not trivial to persuade Spark to produce files encoded this
way. DELTA can be extremely compact a representation for
slowly varying and/or monotonically increasing integers.
* nanosecond resolution times: the new extended "logical" types
system supports nanoseconds alongside the previous millis and
micros. We now emit these for the default pandas time type,
and produce full parquet schema including both "converted" and
"logical" type information. Note that all output has
isAdjustedToUTC=True, i.e., these are timestamps rather than
local time. The time-zone is stored in the metadata, as
before, and will be successfully recreated only in fastparquet
and (py)arrow. Otherwise, the times will appear to be UTC. For
compatibility with Spark, you may still want to use
times="int96" when writing.
* DataPageV2 writing: now we support both reading and writing.
For writing, can be enabled with the environment variable
FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer.
DATAPAGE_VERSION and is off by default. It will become on by
default in the future. In many cases, V2 will result in better
read performance, because the data and page headers are
encoded separately, so data can be directly read into the
output without addition allocation/copies. This feature is
considered experimental, but we believe it working well for
most use cases (i.e., our test suite) and should be readable
by all modern parquet frameworks including arrow and spark.
* pandas nullable types: pandas supports "masked" extension
arrays for types that previously could not support NULL at
all: ints and bools. Fastparquet used to cast such columns to
float, so that we could represent NULLs as NaN; now we use the
new(er) masked types by default. This means faster reading of
such columns, as there is no conversion. If the metadata
guarantees that there are no nulls, we still use the
non-nullable variant unless the data was written with
fastparquet/pyarrow, and the metadata indicates that the
original datatype was nullable. We already handled writing of
nullable columns.
OBS-URL: https://build.opensuse.org/request/show/910725
OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
|
|
|
# this is not meant for setup.py
|
|
|
|
sed -i "s/oldest-supported-numpy/numpy/" setup.py
|
2021-01-03 11:17:10 +01:00
|
|
|
# the tests import the fastparquet.test module and we need to import from sitearch, so install it.
|
|
|
|
sed -i -e "s/^\s*packages=\[/&'fastparquet.test', /" -e "/exclude_package_data/ d" setup.py
|
2018-05-15 10:08:00 +02:00
|
|
|
|
|
|
|
%build
|
|
|
|
export CFLAGS="%{optflags}"
|
2023-01-03 08:37:17 +01:00
|
|
|
export SETUPTOOLS_SCM_PRETEND_VERSION=%{version}
|
|
|
|
%pyproject_wheel
|
2018-05-15 10:08:00 +02:00
|
|
|
|
|
|
|
%install
|
2023-01-03 08:37:17 +01:00
|
|
|
%pyproject_install
|
2021-05-19 12:12:11 +02:00
|
|
|
%python_expand rm -v %{buildroot}%{$python_sitearch}/fastparquet/{speedups,cencoding}.c
|
2018-05-15 10:08:00 +02:00
|
|
|
%python_expand %fdupes %{buildroot}%{$python_sitearch}
|
|
|
|
|
|
|
|
%check
|
2022-04-26 16:21:38 +02:00
|
|
|
%pytest_arch --pyargs fastparquet --import-mode append -n auto
|
2018-05-15 10:08:00 +02:00
|
|
|
|
|
|
|
%files %{python_files}
|
|
|
|
%doc README.rst
|
|
|
|
%license LICENSE
|
2021-01-03 11:17:10 +01:00
|
|
|
%{python_sitearch}/fastparquet
|
|
|
|
%{python_sitearch}/fastparquet-%{version}*-info
|
2018-05-15 10:08:00 +02:00
|
|
|
|
|
|
|
%changelog
|