From 4cca0bf52a4ec87f1038adb8ed3b6368b5344598e253674ba02615592a094592 Mon Sep 17 00:00:00 2001 From: Matej Cepl Date: Mon, 9 Aug 2021 13:21:06 +0000 Subject: [PATCH] Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34 --- fastparquet-0.6.3.tar.gz | 3 -- fastparquet-0.7.1.tar.gz | 3 ++ python-fastparquet.changes | 76 ++++++++++++++++++++++++++++++++++++++ python-fastparquet.spec | 11 ++++-- 4 files changed, 86 insertions(+), 7 deletions(-) delete mode 100644 fastparquet-0.6.3.tar.gz create mode 100644 fastparquet-0.7.1.tar.gz diff --git a/fastparquet-0.6.3.tar.gz b/fastparquet-0.6.3.tar.gz deleted file mode 100644 index a0cfe40..0000000 --- a/fastparquet-0.6.3.tar.gz +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:ae834d98670b7d67fd3dbadd09c6475de4a675e74eca9160969a9bd0fef2f4c2 -size 29120288 diff --git a/fastparquet-0.7.1.tar.gz b/fastparquet-0.7.1.tar.gz new file mode 100644 index 0000000..4425f37 --- /dev/null +++ b/fastparquet-0.7.1.tar.gz @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cc55e0f9048394e3b67d3af934bc690572e81c0c22488b63960bbe67d16e113e +size 29164760 diff --git a/python-fastparquet.changes b/python-fastparquet.changes index 28563e5..16781e9 100644 --- a/python-fastparquet.changes +++ b/python-fastparquet.changes @@ -1,3 +1,79 @@ +------------------------------------------------------------------- +Sun Aug 8 15:13:55 UTC 2021 - Ben Greiner + +- Update to version 0.7.1 + * Back compile for older versions of numpy + * Make pandas nullable types opt-out. The old behaviour (casting + to float) is still available with ParquetFile(..., + pandas_nulls=False). + * Fix time field regression: IsAdjustedToUTC will be False when + there is no timezone + * Micro improvements to the speed of ParquetFile creation by + using simple simple string ops instead of regex and + regularising filenames once at the start. Effects datasets with + many files. +- Release 0.7.0 + * This version institutes major, breaking changes, listed here, + and incremental fixes and additions. + * Reading a directory without a _metadata summary file now works + by providing only the directory, instead of a list of + constituent files. This change also makes direct of use of + fsspec filesystems, if given, to be able to load the footer + metadata areas of the files concurrently, if the storage + backend supports it, and not directly instantiating + intermediate ParquetFile instances + * row-level filtering of the data. Whereas previously, only full + row-groups could be excluded on the basis of their parquet + metadata statistics (if present), filtering can now be done + within row-groups too. The syntax is the same as before, + allowing for multiple column expressions to be combined with + AND|OR, depending on the list structure. This mechanism + requires two passes: one to load the columns needed to create + the boolean mask, and another to load the columns actually + needed in the output. This will not be faster, and may be + slower, but in some cases can save significant memory + footprint, if a small fraction of rows are considered good and + the columns for the filter expression are not in the output. + Not currently supported for reading with DataPageV2. + * DELTA integer encoding (read-only): experimentally working, + but we only have one test file to verify against, since it is + not trivial to persuade Spark to produce files encoded this + way. DELTA can be extremely compact a representation for + slowly varying and/or monotonically increasing integers. + * nanosecond resolution times: the new extended "logical" types + system supports nanoseconds alongside the previous millis and + micros. We now emit these for the default pandas time type, + and produce full parquet schema including both "converted" and + "logical" type information. Note that all output has + isAdjustedToUTC=True, i.e., these are timestamps rather than + local time. The time-zone is stored in the metadata, as + before, and will be successfully recreated only in fastparquet + and (py)arrow. Otherwise, the times will appear to be UTC. For + compatibility with Spark, you may still want to use + times="int96" when writing. + * DataPageV2 writing: now we support both reading and writing. + For writing, can be enabled with the environment variable + FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. + DATAPAGE_VERSION and is off by default. It will become on by + default in the future. In many cases, V2 will result in better + read performance, because the data and page headers are + encoded separately, so data can be directly read into the + output without addition allocation/copies. This feature is + considered experimental, but we believe it working well for + most use cases (i.e., our test suite) and should be readable + by all modern parquet frameworks including arrow and spark. + * pandas nullable types: pandas supports "masked" extension + arrays for types that previously could not support NULL at + all: ints and bools. Fastparquet used to cast such columns to + float, so that we could represent NULLs as NaN; now we use the + new(er) masked types by default. This means faster reading of + such columns, as there is no conversion. If the metadata + guarantees that there are no nulls, we still use the + non-nullable variant unless the data was written with + fastparquet/pyarrow, and the metadata indicates that the + original datatype was nullable. We already handled writing of + nullable columns. + ------------------------------------------------------------------- Tue May 18 14:41:46 UTC 2021 - Ben Greiner diff --git a/python-fastparquet.spec b/python-fastparquet.spec index 804344a..a93e6eb 100644 --- a/python-fastparquet.spec +++ b/python-fastparquet.spec @@ -21,7 +21,7 @@ %define skip_python2 1 %define skip_python36 1 Name: python-fastparquet -Version: 0.6.3 +Version: 0.7.1 Release: 0 Summary: Python support for Parquet file format License: Apache-2.0 @@ -29,8 +29,9 @@ URL: https://github.com/dask/fastparquet/ Source: https://github.com/dask/fastparquet/archive/%{version}.tar.gz#/fastparquet-%{version}.tar.gz BuildRequires: %{python_module Cython} BuildRequires: %{python_module cramjam >= 2.3.0} -BuildRequires: %{python_module fsspec} -BuildRequires: %{python_module numpy-devel >= 1.11} +# version requirement not declared for runtime, but necessary for tests. +BuildRequires: %{python_module fsspec >= 2021.6.0} +BuildRequires: %{python_module numpy-devel >= 1.18} BuildRequires: %{python_module pandas >= 1.1.0} BuildRequires: %{python_module pytest} BuildRequires: %{python_module python-lzo} @@ -40,7 +41,7 @@ BuildRequires: fdupes BuildRequires: python-rpm-macros Requires: python-cramjam >= 2.3.0 Requires: python-fsspec -Requires: python-numpy >= 1.11 +Requires: python-numpy >= 1.18 Requires: python-pandas >= 1.1.0 Requires: python-thrift >= 0.11.0 Recommends: python-python-lzo @@ -54,6 +55,8 @@ for integrating it into python-based Big Data workflows. %setup -q -n fastparquet-%{version} # remove pytest-runner from setup_requires sed -i "s/'pytest-runner',//" setup.py +# this is not meant for setup.py +sed -i "s/oldest-supported-numpy/numpy/" setup.py # the tests import the fastparquet.test module and we need to import from sitearch, so install it. sed -i -e "s/^\s*packages=\[/&'fastparquet.test', /" -e "/exclude_package_data/ d" setup.py