diff --git a/fastparquet-0.6.3.tar.gz b/fastparquet-0.6.3.tar.gz deleted file mode 100644 index a0cfe40..0000000 --- a/fastparquet-0.6.3.tar.gz +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:ae834d98670b7d67fd3dbadd09c6475de4a675e74eca9160969a9bd0fef2f4c2 -size 29120288 diff --git a/fastparquet-0.7.1.tar.gz b/fastparquet-0.7.1.tar.gz new file mode 100644 index 0000000..4425f37 --- /dev/null +++ b/fastparquet-0.7.1.tar.gz @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cc55e0f9048394e3b67d3af934bc690572e81c0c22488b63960bbe67d16e113e +size 29164760 diff --git a/python-fastparquet.changes b/python-fastparquet.changes index 28563e5..16781e9 100644 --- a/python-fastparquet.changes +++ b/python-fastparquet.changes @@ -1,3 +1,79 @@ +------------------------------------------------------------------- +Sun Aug 8 15:13:55 UTC 2021 - Ben Greiner + +- Update to version 0.7.1 + * Back compile for older versions of numpy + * Make pandas nullable types opt-out. The old behaviour (casting + to float) is still available with ParquetFile(..., + pandas_nulls=False). + * Fix time field regression: IsAdjustedToUTC will be False when + there is no timezone + * Micro improvements to the speed of ParquetFile creation by + using simple simple string ops instead of regex and + regularising filenames once at the start. Effects datasets with + many files. +- Release 0.7.0 + * This version institutes major, breaking changes, listed here, + and incremental fixes and additions. + * Reading a directory without a _metadata summary file now works + by providing only the directory, instead of a list of + constituent files. This change also makes direct of use of + fsspec filesystems, if given, to be able to load the footer + metadata areas of the files concurrently, if the storage + backend supports it, and not directly instantiating + intermediate ParquetFile instances + * row-level filtering of the data. Whereas previously, only full + row-groups could be excluded on the basis of their parquet + metadata statistics (if present), filtering can now be done + within row-groups too. The syntax is the same as before, + allowing for multiple column expressions to be combined with + AND|OR, depending on the list structure. This mechanism + requires two passes: one to load the columns needed to create + the boolean mask, and another to load the columns actually + needed in the output. This will not be faster, and may be + slower, but in some cases can save significant memory + footprint, if a small fraction of rows are considered good and + the columns for the filter expression are not in the output. + Not currently supported for reading with DataPageV2. + * DELTA integer encoding (read-only): experimentally working, + but we only have one test file to verify against, since it is + not trivial to persuade Spark to produce files encoded this + way. DELTA can be extremely compact a representation for + slowly varying and/or monotonically increasing integers. + * nanosecond resolution times: the new extended "logical" types + system supports nanoseconds alongside the previous millis and + micros. We now emit these for the default pandas time type, + and produce full parquet schema including both "converted" and + "logical" type information. Note that all output has + isAdjustedToUTC=True, i.e., these are timestamps rather than + local time. The time-zone is stored in the metadata, as + before, and will be successfully recreated only in fastparquet + and (py)arrow. Otherwise, the times will appear to be UTC. For + compatibility with Spark, you may still want to use + times="int96" when writing. + * DataPageV2 writing: now we support both reading and writing. + For writing, can be enabled with the environment variable + FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. + DATAPAGE_VERSION and is off by default. It will become on by + default in the future. In many cases, V2 will result in better + read performance, because the data and page headers are + encoded separately, so data can be directly read into the + output without addition allocation/copies. This feature is + considered experimental, but we believe it working well for + most use cases (i.e., our test suite) and should be readable + by all modern parquet frameworks including arrow and spark. + * pandas nullable types: pandas supports "masked" extension + arrays for types that previously could not support NULL at + all: ints and bools. Fastparquet used to cast such columns to + float, so that we could represent NULLs as NaN; now we use the + new(er) masked types by default. This means faster reading of + such columns, as there is no conversion. If the metadata + guarantees that there are no nulls, we still use the + non-nullable variant unless the data was written with + fastparquet/pyarrow, and the metadata indicates that the + original datatype was nullable. We already handled writing of + nullable columns. + ------------------------------------------------------------------- Tue May 18 14:41:46 UTC 2021 - Ben Greiner diff --git a/python-fastparquet.spec b/python-fastparquet.spec index 804344a..a93e6eb 100644 --- a/python-fastparquet.spec +++ b/python-fastparquet.spec @@ -21,7 +21,7 @@ %define skip_python2 1 %define skip_python36 1 Name: python-fastparquet -Version: 0.6.3 +Version: 0.7.1 Release: 0 Summary: Python support for Parquet file format License: Apache-2.0 @@ -29,8 +29,9 @@ URL: https://github.com/dask/fastparquet/ Source: https://github.com/dask/fastparquet/archive/%{version}.tar.gz#/fastparquet-%{version}.tar.gz BuildRequires: %{python_module Cython} BuildRequires: %{python_module cramjam >= 2.3.0} -BuildRequires: %{python_module fsspec} -BuildRequires: %{python_module numpy-devel >= 1.11} +# version requirement not declared for runtime, but necessary for tests. +BuildRequires: %{python_module fsspec >= 2021.6.0} +BuildRequires: %{python_module numpy-devel >= 1.18} BuildRequires: %{python_module pandas >= 1.1.0} BuildRequires: %{python_module pytest} BuildRequires: %{python_module python-lzo} @@ -40,7 +41,7 @@ BuildRequires: fdupes BuildRequires: python-rpm-macros Requires: python-cramjam >= 2.3.0 Requires: python-fsspec -Requires: python-numpy >= 1.11 +Requires: python-numpy >= 1.18 Requires: python-pandas >= 1.1.0 Requires: python-thrift >= 0.11.0 Recommends: python-python-lzo @@ -54,6 +55,8 @@ for integrating it into python-based Big Data workflows. %setup -q -n fastparquet-%{version} # remove pytest-runner from setup_requires sed -i "s/'pytest-runner',//" setup.py +# this is not meant for setup.py +sed -i "s/oldest-supported-numpy/numpy/" setup.py # the tests import the fastparquet.test module and we need to import from sitearch, so install it. sed -i -e "s/^\s*packages=\[/&'fastparquet.test', /" -e "/exclude_package_data/ d" setup.py