python-fastparquet/python-fastparquet.changes
Matej Cepl 4cca0bf52a Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric
- Update to version 0.7.1
  * Back compile for older versions of numpy
  * Make pandas nullable types opt-out. The old behaviour (casting
    to float) is still available with ParquetFile(...,
    pandas_nulls=False).
  * Fix time field regression: IsAdjustedToUTC will be False when
    there is no timezone
  * Micro improvements to the speed of ParquetFile creation by
    using simple simple string ops instead of regex and
    regularising filenames once at the start. Effects datasets with
    many files.
- Release 0.7.0
  * This version institutes major, breaking changes, listed here,
    and incremental fixes and additions.
  * Reading a directory without a _metadata summary file now works
    by providing only the directory, instead of a list of
    constituent files. This change also makes direct of use of
    fsspec filesystems, if given, to be able to load the footer
    metadata areas of the files concurrently, if the storage
    backend supports it, and not directly instantiating
    intermediate ParquetFile instances
  * row-level filtering of the data. Whereas previously, only full 
    row-groups could be excluded on the basis of their parquet 
    metadata statistics (if present), filtering can now be done 
    within row-groups too. The syntax is the same as before, 
    allowing for multiple column expressions to be combined with 
    AND|OR, depending on the list structure. This mechanism 
    requires two passes: one to load the columns needed to create 
    the boolean mask, and another to load the columns actually 
    needed in the output. This will not be faster, and may be 
    slower, but in some cases can save significant memory 
    footprint, if a small fraction of rows are considered good and 
    the columns for the filter expression are not in the output. 
    Not currently supported for reading with DataPageV2.
  * DELTA integer encoding (read-only): experimentally working, 
    but we only have one test file to verify against, since it is 
    not trivial to persuade Spark to produce files encoded this 
    way. DELTA can be extremely compact a representation for 
    slowly varying and/or monotonically increasing integers.
  * nanosecond resolution times: the new extended "logical" types 
    system supports nanoseconds alongside the previous millis and 
    micros. We now emit these for the default pandas time type, 
    and produce full parquet schema including both "converted" and 
    "logical" type information. Note that all output has 
    isAdjustedToUTC=True, i.e., these are timestamps rather than 
    local time. The time-zone is stored in the metadata, as 
    before, and will be successfully recreated only in fastparquet 
    and (py)arrow. Otherwise, the times will appear to be UTC. For 
    compatibility with Spark, you may still want to use 
    times="int96" when writing.
  * DataPageV2 writing: now we support both reading and writing. 
    For writing, can be enabled with the environment variable 
    FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer.
    DATAPAGE_VERSION and is off by default. It will become on by 
    default in the future. In many cases, V2 will result in better 
    read performance, because the data and page headers are 
    encoded separately, so data can be directly read into the 
    output without addition allocation/copies. This feature is 
    considered experimental, but we believe it working well for 
    most use cases (i.e., our test suite) and should be readable 
    by all modern parquet frameworks including arrow and spark.
  * pandas nullable types: pandas supports "masked" extension 
    arrays for types that previously could not support NULL at 
    all: ints and bools. Fastparquet used to cast such columns to 
    float, so that we could represent NULLs as NaN; now we use the 
    new(er) masked types by default. This means faster reading of 
    such columns, as there is no conversion. If the metadata 
    guarantees that there are no nulls, we still use the 
    non-nullable variant unless the data was written with 
    fastparquet/pyarrow, and the metadata indicates that the 
    original datatype was nullable. We already handled writing of 
    nullable columns.

OBS-URL: https://build.opensuse.org/request/show/910725
OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 13:21:06 +00:00

375 lines
15 KiB
Plaintext

-------------------------------------------------------------------
Sun Aug 8 15:13:55 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Update to version 0.7.1
* Back compile for older versions of numpy
* Make pandas nullable types opt-out. The old behaviour (casting
to float) is still available with ParquetFile(...,
pandas_nulls=False).
* Fix time field regression: IsAdjustedToUTC will be False when
there is no timezone
* Micro improvements to the speed of ParquetFile creation by
using simple simple string ops instead of regex and
regularising filenames once at the start. Effects datasets with
many files.
- Release 0.7.0
* This version institutes major, breaking changes, listed here,
and incremental fixes and additions.
* Reading a directory without a _metadata summary file now works
by providing only the directory, instead of a list of
constituent files. This change also makes direct of use of
fsspec filesystems, if given, to be able to load the footer
metadata areas of the files concurrently, if the storage
backend supports it, and not directly instantiating
intermediate ParquetFile instances
* row-level filtering of the data. Whereas previously, only full
row-groups could be excluded on the basis of their parquet
metadata statistics (if present), filtering can now be done
within row-groups too. The syntax is the same as before,
allowing for multiple column expressions to be combined with
AND|OR, depending on the list structure. This mechanism
requires two passes: one to load the columns needed to create
the boolean mask, and another to load the columns actually
needed in the output. This will not be faster, and may be
slower, but in some cases can save significant memory
footprint, if a small fraction of rows are considered good and
the columns for the filter expression are not in the output.
Not currently supported for reading with DataPageV2.
* DELTA integer encoding (read-only): experimentally working,
but we only have one test file to verify against, since it is
not trivial to persuade Spark to produce files encoded this
way. DELTA can be extremely compact a representation for
slowly varying and/or monotonically increasing integers.
* nanosecond resolution times: the new extended "logical" types
system supports nanoseconds alongside the previous millis and
micros. We now emit these for the default pandas time type,
and produce full parquet schema including both "converted" and
"logical" type information. Note that all output has
isAdjustedToUTC=True, i.e., these are timestamps rather than
local time. The time-zone is stored in the metadata, as
before, and will be successfully recreated only in fastparquet
and (py)arrow. Otherwise, the times will appear to be UTC. For
compatibility with Spark, you may still want to use
times="int96" when writing.
* DataPageV2 writing: now we support both reading and writing.
For writing, can be enabled with the environment variable
FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer.
DATAPAGE_VERSION and is off by default. It will become on by
default in the future. In many cases, V2 will result in better
read performance, because the data and page headers are
encoded separately, so data can be directly read into the
output without addition allocation/copies. This feature is
considered experimental, but we believe it working well for
most use cases (i.e., our test suite) and should be readable
by all modern parquet frameworks including arrow and spark.
* pandas nullable types: pandas supports "masked" extension
arrays for types that previously could not support NULL at
all: ints and bools. Fastparquet used to cast such columns to
float, so that we could represent NULLs as NaN; now we use the
new(er) masked types by default. This means faster reading of
such columns, as there is no conversion. If the metadata
guarantees that there are no nulls, we still use the
non-nullable variant unless the data was written with
fastparquet/pyarrow, and the metadata indicates that the
original datatype was nullable. We already handled writing of
nullable columns.
-------------------------------------------------------------------
Tue May 18 14:41:46 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Update to version 0.6.3
* no release notes
* new requirement: cramjam instead of separate compression libs
and their bindings
* switch from numba to Cython
-------------------------------------------------------------------
Fri Feb 12 14:50:18 UTC 2021 - Dirk Müller <dmueller@suse.com>
- skip python 36 build
-------------------------------------------------------------------
Thu Feb 4 17:50:32 UTC 2021 - Jan Engelhardt <jengelh@inai.de>
- Use of "+=" in %check warrants bash as buildshell.
-------------------------------------------------------------------
Wed Feb 3 21:43:10 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Skip the import without warning test gh#dask/fastparquet#558
- Apply the Cepl-Strangelove-Parameter to pytest
(--import-mode append)
-------------------------------------------------------------------
Sat Jan 2 21:04:30 UTC 2021 - Benjamin Greiner <code@bnavigator.de>
- update to version 0.5
* no changelog
- update test suite setup -- install the .test module
-------------------------------------------------------------------
Sat Jul 18 18:13:53 UTC 2020 - Arun Persaud <arun@gmx.de>
- specfile:
* update requirements: version numbers and added packaging
- update to version 0.4.1:
* nulls, fixes #504
* deps: Add missing dependency on packaging. (#502)
-------------------------------------------------------------------
Thu Jul 9 14:04:10 UTC 2020 - Marketa Calabkova <mcalabkova@suse.com>
- Update to 0.4.0
* Changed RangeIndex private methods to public ones
* Use the python executable used to run the code
* Add support for Python 3.8
* support for numba > 0.48
- drop upstreamed patch use-python-exec.patch
-------------------------------------------------------------------
Mon Apr 6 06:54:36 UTC 2020 - Tomáš Chvátal <tchvatal@suse.com>
- Add patch to use sys.executable and not call py2 binary directly:
* use-python-exec.patch
-------------------------------------------------------------------
Mon Apr 6 06:50:26 UTC 2020 - Tomáš Chvátal <tchvatal@suse.com>
- Update to 0.3.3:
* no upstream changelog
-------------------------------------------------------------------
Fri Oct 25 17:50:50 UTC 2019 - Todd R <toddrme2178@gmail.com>
- Drop broken python 2 support.
- Testing fixes
-------------------------------------------------------------------
Sat Aug 3 15:10:41 UTC 2019 - Arun Persaud <arun@gmx.de>
- update to version 0.3.2:
* Only calculate dataset stats once (#453)
* Fixes #436 (#452)
* Fix a crash if trying to read a file whose created_by value is not
set
* COMPAT: Fix for pandas DeprecationWarning (#446)
* Apply timezone to index (#439)
* Handle NaN partition values (#438)
* Pandas meta (#431)
* Only strip _metadata from end of file path (#430)
* Simple nesting fix (#428)
* Disallow bad tz on save, warn on load (#427)
-------------------------------------------------------------------
Tue Jul 30 14:23:21 UTC 2019 - Todd R <toddrme2178@gmail.com>
- Fix spurious test failure
-------------------------------------------------------------------
Mon May 20 15:12:11 CEST 2019 - Matej Cepl <mcepl@suse.com>
- Clean up SPEC file.
-------------------------------------------------------------------
Tue Apr 30 14:28:46 UTC 2019 - Todd R <toddrme2178@gmail.com>
- update to 0.3.1
* Add schema == (__eq__) and != (__ne__) methods and tests.
* Fix item iteration for decimals
* List missing columns in error message
* Fix tz being None case
- Update to 0.3.0
* Squash some warnings and import failures
* Improvements to in and not in operators
* Fixes because pandas released
-------------------------------------------------------------------
Sat Jan 26 17:05:09 UTC 2019 - Arun Persaud <arun@gmx.de>
- specfile:
* update copyright year
- update to version 0.2.1:
* Compat for pandas 0.24.0 refactor (#390)
* Change OverflowError message when failing on large pages (#387)
* Allow for changes in dictionary while reading a row-group column
(#367)
* Correct pypi project names for compression libraries (#385)
-------------------------------------------------------------------
Thu Nov 22 22:47:24 UTC 2018 - Arun Persaud <arun@gmx.de>
- update to version 0.2.0:
* Don't mutate column list input (#383) (#384)
* Add optional requirements to extras_require (#380)
* Fix "broken link to parquet-format page" (#377)
* Add .c file to repo
* Handle rows split across 2 pages in the case of a map (#369)
* Fixes 370 (#371)
* Handle multi-page maps (#368)
* Handle zero-column files. Closes #361. (#363)
-------------------------------------------------------------------
Sun Sep 30 16:22:56 UTC 2018 - Arun Persaud <arun@gmx.de>
- specfile:
* update url
* make %files section more specific
- update to version 0.1.6:
* Restrict what categories get passed through (#358)
* Deep digging for multi-indexes (#356)
* allow_empty is the default in >=zstandard-0.9 (#355)
* Remove setup_requires from setup.py (#345)
* Fixed error if a certain partition is empty, when writing a
partioned (#347)
* Allow UTF8 column names to be read (#342)
* readd test file
* Allow for NULL converted type (#340)
* Robust partition names (#336)
* Fix accidental multiindex
* Read multi indexes (#331)
* Allow reading from any file-like (#330)
* change `parquet-format` link to apache repo (#328)
* Remove extra space from api.py (#325)
* numba bool fun (#324)
- changes from version 0.1.5:
* Fix _dtypes to be more efficient, to work with files with lots of
columns (#318)
* Buildfix (#313)
* Use LZ4 block compression for compatibility with parquet-cpp
(#314) (#315)
* Fix typo in ParquetFile docstring (#312)
* Remove annoying print() when reading file with CategoricalDtype
index (#311)
* Allow lists of multi-file data-sets (#309)
* Acceleate dataframe.empty for small/medium sizes (#307)
* Include dictionary page in column size (#306)
* Fix for selecting columns which were used for partitioning (#304)
* Remove occurances of np.fromstring (#303)
* Add support for zstandard compression (#296)
* Int96time order (#298)
- changes from version 0.1.4:
* Add handling of keyword arguments for compressor (#294)
* Fix setup.py duplication (#295)
* Integrate pytest with setup.py (#293)
* Get setup.py pytest to work. (#287)
* Add LZ4 support (#292)
* Update for forthcoming thrift release (#281)
* If timezones are in pandas metadata, assign columns as required
(#285)
* Pandas import (#284)
* Copy FMDs instead of mutate (#279)
* small fixes (#278)
* fixes to get benchmark to work (#276)
* backwards compat with Dask
* Fix test_time_millis on Windows (#275)
* join paths os-independently (#271)
* Adds int32 support for object encoding (#268)
* Fix a couple small typos in documentation (#267)
* Partition order should be sorted (#265)
* COMPAT: Update thrift (#264)
* Speedups result (#253)
* Remove thrift_copy
* Define `__copy__` on thrift structures
* Update rtd deps
- changes from version 0.1.3:
* More care over append when partitioning multiple columns
* Sep for windows cats filtering
* Move pytest imports to tests/ remove requirememnt
* Special-case only zeros
* Cope with partition values like "07"
* fix for s3
* Fix for list of paths rooted in the current directory
* add test
* Explicit file opens
* update docstring
* Refactor partition interpretation
* py2 fix
* Error in test changed
* Better error messages when failed to cnovert on write
- changes from version 0.1.2:
* Revert accidental removal of s3 import
* Move thrift things together, and make thrift serializer for pickle
* COMPAT: for new pandas CategoricalDtype
* Fixup for backwards seeking.
* Fix some test failures
* Protptype version using thrift instead of thriftpy
* Not all mergers have cats
* Revert accidental deletion
* remove warnings
* Sort keys in json for metadata
* Check column chunks for categories sizes
* Account for partition dir names with numbers
* Fix map/list doc
* Catch more stats errors
* Prevent pandas auto-names being given to index
- changes from version 0.1.1:
* Add workaround for single-value-partition
* update test
* Simplify and fix for py2
* Use thrift encoding on statistics strings
* remove redundant SNAPPY from supported compressions list
* Fix statistics
* lists again
* Always convert int96 to times
* Update docs
* attribute typo
* Fix definition level
* Add test, clean columns
* Allow optional->optional lists and maps
* Flatten schema to enable loading of non-repeated columns
* Remove extra file
* Fix py2
* Fix "in" filter to cope with strings that could be numbers
* Allow pip install without NumPy or Cython
- changes from version 0.1.0:
* Add ParquetFile attribute documentation
* Fix tests
* Enable append to an empty dataset
* More warning words and check on partition_on
* Do not fail stats if there are no row-groups
* Fix "numpy_dtype"->"numpy_type
* "in" was checking range not exact membership of set
* If metadata gives index, put in columns
* Fix pytest warning
* Fail on ordering dict statistics
* Fix stats filter
* clean test
* Fix ImportWarning on Python 3.6+
* TEST: added updated test file for special strings used in filters
* fix links
* [README]: indicate dependency on LLVM 4.0.x.
* Filter stats had unfortunate converted_type check
* Ignore exceptions in val_to_num
* Also for TODAY
* Very special case for partition: NOW should be kept as string
* Allow partition_on; fix category nuls
* Remove old category key/values on writing
* Implement writing pandas metadata and auto-setting cats/index
* Pandas compatability
* Test and fix for filter on single file
* Do not attempt to recurse into schema elements with zero childrean
-------------------------------------------------------------------
Thu Jun 7 20:41:31 UTC 2018 - jengelh@inai.de
- Fixup grammar./Replace future aims with what it does now.
-------------------------------------------------------------------
Thu May 3 14:07:08 UTC 2018 - toddrme2178@gmail.com
- Use %license tag
-------------------------------------------------------------------
Thu May 25 12:19:26 UTC 2017 - toddrme2178@gmail.com
- Initial version