python-fastparquet/python-fastparquet.changes

510 lines
19 KiB
Plaintext
Raw Normal View History

- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
-------------------------------------------------------------------
Tue Dec 5 12:23:32 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 2023.10.0:
* Datetime units in empty() with tz (#893)
* Fewer inplace decompressions for V2 pages (#890
* Allow writing categorical column with no categories (#888)
* Fixes for new numpy (#886)
* RLE bools and DELTA for v1 pages (#885, 883)
-------------------------------------------------------------------
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
Mon Sep 11 21:29:16 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 2023.8.0:
* More general timestamp units (#874)
* ReadTheDocs V2 (#871)
* Better roundtrip dtypes (#861, 859)
* No convert when computing bytes-per-item for str (#858)
-------------------------------------------------------------------
Sat Jul 1 20:05:36 UTC 2023 - Arun Persaud <arun@gmx.de>
- update to version 2023.7.0:
* Add test case for reading non-pandas parquet file (#870)
* Extra field when cloning ParquetFile (#866)
-------------------------------------------------------------------
Fri Apr 28 08:10:46 UTC 2023 - Dirk Müller <dmueller@suse.com>
- update to 2023.4.0:
* allow loading categoricals even if not so in the pandas metadata,
when a column is dict-encodedand we only have one row-group (#863)
* apply dtype to the columns names series, even when selecting no
columns (#861, 859)
* don't make strings while estimating bye column size (#858)
* handle upstream depr (#857, 856)
-------------------------------------------------------------------
Thu Feb 9 15:55:08 UTC 2023 - Arun Persaud <arun@gmx.de>
- update to version 2023.2.0:
* revert one-level set of filters (#852)
* full size dict for decoding V2 pages (#850)
* infer_object_encoding fix (#847)
* row filtering with V2 pages (#845)
-------------------------------------------------------------------
Wed Feb 8 18:25:03 UTC 2023 - Arun Persaud <arun@gmx.de>
- specfile:
* remove fastparquet-pr835.patch, implemented upstream
- update to version 2023.1.0:
* big improvement to write speed
* paging support for bigger row-groups
* pandas 2.0 support
* delta for big-endian architecture
-------------------------------------------------------------------
Mon Jan 2 20:38:49 UTC 2023 - Ben Greiner <code@bnavigator.de>
- Update to 2022.12.0
* check all int32 values before passing to thrift writer
* fix type of num_rows to i64 for big single file
- Release 2022.11.0
* Switch to calver
* Speed up loading of nullable types
* Allow schema evolution by addition of columns
* Allow specifying dtypes of output
* update to scm versioning
* fixes to row filter, statistics and tests
* support pathlib.Paths
* JSON encoder options
- Drop fastparquet-pr813-updatefixes.patch
-------------------------------------------------------------------
Fri Dec 23 09:18:39 UTC 2022 - Guillaume GARDET <guillaume.gardet@opensuse.org>
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
- Add patch to fox the test test_delta_from_def_2 on
aarch64, armv7 and ppc64le:
* fastparquet-pr835.patch
-------------------------------------------------------------------
Fri Oct 28 15:47:41 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 0.8.3
* improved key/value handling and rejection of bad types
* fix regression in consolidate_cats (caught in dask tests)
- Release 0.8.2
* datetime indexes initialised to 0 to prevent overflow from
randommemory
* case from csv_to_parquet where stats exists but has not nulls
entry
* define len and bool for ParquetFile
* maintain int types of optional data tha came from pandas
* fix for delta encoding
- Add fastparquet-pr813-updatefixes.patch gh#dask/fastparquet#813
-------------------------------------------------------------------
Tue Apr 26 11:02:27 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 0.8.1
* fix critical buffer overflow crash for large number of columns
and long column names
* metadata handling
* thrift int32 for list
* avoid error storing NaNs in column stats
-------------------------------------------------------------------
Sat Jan 29 21:36:38 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Update to 0.8.0
* our own cythonic thrift implementation (drop thrift dependency)
* more in-place dataset editing ad reordering
* python 3.10 support
* fixes for multi-index and pandas types
- Clean test skips
-------------------------------------------------------------------
Sun Jan 16 13:34:53 UTC 2022 - Ben Greiner <code@bnavigator.de>
- Clean specfile from unused python36 conditionals
- Require thrift 0.15.0 (+patch) for Python 3.10 compatibility
* gh#dask/fastparquet#514
-------------------------------------------------------------------
Sat Nov 27 20:34:53 UTC 2021 - Arun Persaud <arun@gmx.de>
- update to version 0.7.2:
* Ability to remove row-groups in-place for multifile datasets
* Accept pandas nullable Float type
* allow empty strings and fix min/max when there is no data
* make writing statistics optional
* row selection in to_pandas()
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
-------------------------------------------------------------------
Sun Aug 8 15:13:55 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Update to version 0.7.1
* Back compile for older versions of numpy
* Make pandas nullable types opt-out. The old behaviour (casting
to float) is still available with ParquetFile(...,
pandas_nulls=False).
* Fix time field regression: IsAdjustedToUTC will be False when
there is no timezone
* Micro improvements to the speed of ParquetFile creation by
using simple simple string ops instead of regex and
regularising filenames once at the start. Effects datasets with
many files.
- Release 0.7.0
* This version institutes major, breaking changes, listed here,
and incremental fixes and additions.
* Reading a directory without a _metadata summary file now works
by providing only the directory, instead of a list of
constituent files. This change also makes direct of use of
fsspec filesystems, if given, to be able to load the footer
metadata areas of the files concurrently, if the storage
backend supports it, and not directly instantiating
intermediate ParquetFile instances
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
* row-level filtering of the data. Whereas previously, only full
row-groups could be excluded on the basis of their parquet
metadata statistics (if present), filtering can now be done
within row-groups too. The syntax is the same as before,
allowing for multiple column expressions to be combined with
AND|OR, depending on the list structure. This mechanism
requires two passes: one to load the columns needed to create
the boolean mask, and another to load the columns actually
needed in the output. This will not be faster, and may be
slower, but in some cases can save significant memory
footprint, if a small fraction of rows are considered good and
the columns for the filter expression are not in the output.
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
Not currently supported for reading with DataPageV2.
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
* DELTA integer encoding (read-only): experimentally working,
but we only have one test file to verify against, since it is
not trivial to persuade Spark to produce files encoded this
way. DELTA can be extremely compact a representation for
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
slowly varying and/or monotonically increasing integers.
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
* nanosecond resolution times: the new extended "logical" types
system supports nanoseconds alongside the previous millis and
micros. We now emit these for the default pandas time type,
and produce full parquet schema including both "converted" and
"logical" type information. Note that all output has
isAdjustedToUTC=True, i.e., these are timestamps rather than
local time. The time-zone is stored in the metadata, as
before, and will be successfully recreated only in fastparquet
and (py)arrow. Otherwise, the times will appear to be UTC. For
compatibility with Spark, you may still want to use
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
times="int96" when writing.
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
* DataPageV2 writing: now we support both reading and writing.
For writing, can be enabled with the environment variable
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer.
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
DATAPAGE_VERSION and is off by default. It will become on by
default in the future. In many cases, V2 will result in better
read performance, because the data and page headers are
encoded separately, so data can be directly read into the
output without addition allocation/copies. This feature is
considered experimental, but we believe it working well for
most use cases (i.e., our test suite) and should be readable
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
by all modern parquet frameworks including arrow and spark.
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
* pandas nullable types: pandas supports "masked" extension
arrays for types that previously could not support NULL at
all: ints and bools. Fastparquet used to cast such columns to
float, so that we could represent NULLs as NaN; now we use the
new(er) masked types by default. This means faster reading of
such columns, as there is no conversion. If the metadata
guarantees that there are no nulls, we still use the
non-nullable variant unless the data was written with
fastparquet/pyarrow, and the metadata indicates that the
original datatype was nullable. We already handled writing of
Accepting request 910725 from home:bnavigator:branches:devel:languages:python:numeric - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. OBS-URL: https://build.opensuse.org/request/show/910725 OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=34
2021-08-09 15:21:06 +02:00
nullable columns.
-------------------------------------------------------------------
Tue May 18 14:41:46 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Update to version 0.6.3
* no release notes
* new requirement: cramjam instead of separate compression libs
and their bindings
* switch from numba to Cython
-------------------------------------------------------------------
Fri Feb 12 14:50:18 UTC 2021 - Dirk Müller <dmueller@suse.com>
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
- skip python 36 build
-------------------------------------------------------------------
Thu Feb 4 17:50:32 UTC 2021 - Jan Engelhardt <jengelh@inai.de>
- Use of "+=" in %check warrants bash as buildshell.
-------------------------------------------------------------------
Wed Feb 3 21:43:10 UTC 2021 - Ben Greiner <code@bnavigator.de>
- Skip the import without warning test gh#dask/fastparquet#558
- Apply the Cepl-Strangelove-Parameter to pytest
(--import-mode append)
-------------------------------------------------------------------
Sat Jan 2 21:04:30 UTC 2021 - Benjamin Greiner <code@bnavigator.de>
- update to version 0.5
* no changelog
- update test suite setup -- install the .test module
-------------------------------------------------------------------
Sat Jul 18 18:13:53 UTC 2020 - Arun Persaud <arun@gmx.de>
- specfile:
* update requirements: version numbers and added packaging
- update to version 0.4.1:
* nulls, fixes #504
* deps: Add missing dependency on packaging. (#502)
-------------------------------------------------------------------
Thu Jul 9 14:04:10 UTC 2020 - Marketa Calabkova <mcalabkova@suse.com>
- Update to 0.4.0
* Changed RangeIndex private methods to public ones
* Use the python executable used to run the code
* Add support for Python 3.8
* support for numba > 0.48
- drop upstreamed patch use-python-exec.patch
-------------------------------------------------------------------
Mon Apr 6 06:54:36 UTC 2020 - Tomáš Chvátal <tchvatal@suse.com>
- Add patch to use sys.executable and not call py2 binary directly:
* use-python-exec.patch
-------------------------------------------------------------------
Mon Apr 6 06:50:26 UTC 2020 - Tomáš Chvátal <tchvatal@suse.com>
- Update to 0.3.3:
* no upstream changelog
-------------------------------------------------------------------
Fri Oct 25 17:50:50 UTC 2019 - Todd R <toddrme2178@gmail.com>
- Drop broken python 2 support.
- Testing fixes
-------------------------------------------------------------------
Sat Aug 3 15:10:41 UTC 2019 - Arun Persaud <arun@gmx.de>
- update to version 0.3.2:
* Only calculate dataset stats once (#453)
* Fixes #436 (#452)
* Fix a crash if trying to read a file whose created_by value is not
set
* COMPAT: Fix for pandas DeprecationWarning (#446)
* Apply timezone to index (#439)
* Handle NaN partition values (#438)
* Pandas meta (#431)
* Only strip _metadata from end of file path (#430)
* Simple nesting fix (#428)
* Disallow bad tz on save, warn on load (#427)
-------------------------------------------------------------------
Tue Jul 30 14:23:21 UTC 2019 - Todd R <toddrme2178@gmail.com>
- Fix spurious test failure
-------------------------------------------------------------------
Mon May 20 15:12:11 CEST 2019 - Matej Cepl <mcepl@suse.com>
- Clean up SPEC file.
-------------------------------------------------------------------
Tue Apr 30 14:28:46 UTC 2019 - Todd R <toddrme2178@gmail.com>
- update to 0.3.1
- update to 2023.8.0: * More general timestamp units (#874) * ReadTheDocs V2 (#871) * Better roundtrip dtypes (#861, 859) * No convert when computing bytes-per-item for str (#858) - Add patch to fox the test test_delta_from_def_2 on * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet OBS-URL: https://build.opensuse.org/package/show/devel:languages:python:numeric/python-fastparquet?expand=0&rev=52
2023-12-02 18:26:53 +01:00
* Add schema == (__eq__) and != (__ne__) methods and tests.
* Fix item iteration for decimals
* List missing columns in error message
* Fix tz being None case
- Update to 0.3.0
* Squash some warnings and import failures
* Improvements to in and not in operators
* Fixes because pandas released
-------------------------------------------------------------------
Sat Jan 26 17:05:09 UTC 2019 - Arun Persaud <arun@gmx.de>
- specfile:
* update copyright year
- update to version 0.2.1:
* Compat for pandas 0.24.0 refactor (#390)
* Change OverflowError message when failing on large pages (#387)
* Allow for changes in dictionary while reading a row-group column
(#367)
* Correct pypi project names for compression libraries (#385)
-------------------------------------------------------------------
Thu Nov 22 22:47:24 UTC 2018 - Arun Persaud <arun@gmx.de>
- update to version 0.2.0:
* Don't mutate column list input (#383) (#384)
* Add optional requirements to extras_require (#380)
* Fix "broken link to parquet-format page" (#377)
* Add .c file to repo
* Handle rows split across 2 pages in the case of a map (#369)
* Fixes 370 (#371)
* Handle multi-page maps (#368)
* Handle zero-column files. Closes #361. (#363)
-------------------------------------------------------------------
Sun Sep 30 16:22:56 UTC 2018 - Arun Persaud <arun@gmx.de>
- specfile:
* update url
* make %files section more specific
- update to version 0.1.6:
* Restrict what categories get passed through (#358)
* Deep digging for multi-indexes (#356)
* allow_empty is the default in >=zstandard-0.9 (#355)
* Remove setup_requires from setup.py (#345)
* Fixed error if a certain partition is empty, when writing a
partioned (#347)
* Allow UTF8 column names to be read (#342)
* readd test file
* Allow for NULL converted type (#340)
* Robust partition names (#336)
* Fix accidental multiindex
* Read multi indexes (#331)
* Allow reading from any file-like (#330)
* change `parquet-format` link to apache repo (#328)
* Remove extra space from api.py (#325)
* numba bool fun (#324)
- changes from version 0.1.5:
* Fix _dtypes to be more efficient, to work with files with lots of
columns (#318)
* Buildfix (#313)
* Use LZ4 block compression for compatibility with parquet-cpp
(#314) (#315)
* Fix typo in ParquetFile docstring (#312)
* Remove annoying print() when reading file with CategoricalDtype
index (#311)
* Allow lists of multi-file data-sets (#309)
* Acceleate dataframe.empty for small/medium sizes (#307)
* Include dictionary page in column size (#306)
* Fix for selecting columns which were used for partitioning (#304)
* Remove occurances of np.fromstring (#303)
* Add support for zstandard compression (#296)
* Int96time order (#298)
- changes from version 0.1.4:
* Add handling of keyword arguments for compressor (#294)
* Fix setup.py duplication (#295)
* Integrate pytest with setup.py (#293)
* Get setup.py pytest to work. (#287)
* Add LZ4 support (#292)
* Update for forthcoming thrift release (#281)
* If timezones are in pandas metadata, assign columns as required
(#285)
* Pandas import (#284)
* Copy FMDs instead of mutate (#279)
* small fixes (#278)
* fixes to get benchmark to work (#276)
* backwards compat with Dask
* Fix test_time_millis on Windows (#275)
* join paths os-independently (#271)
* Adds int32 support for object encoding (#268)
* Fix a couple small typos in documentation (#267)
* Partition order should be sorted (#265)
* COMPAT: Update thrift (#264)
* Speedups result (#253)
* Remove thrift_copy
* Define `__copy__` on thrift structures
* Update rtd deps
- changes from version 0.1.3:
* More care over append when partitioning multiple columns
* Sep for windows cats filtering
* Move pytest imports to tests/ remove requirememnt
* Special-case only zeros
* Cope with partition values like "07"
* fix for s3
* Fix for list of paths rooted in the current directory
* add test
* Explicit file opens
* update docstring
* Refactor partition interpretation
* py2 fix
* Error in test changed
* Better error messages when failed to cnovert on write
- changes from version 0.1.2:
* Revert accidental removal of s3 import
* Move thrift things together, and make thrift serializer for pickle
* COMPAT: for new pandas CategoricalDtype
* Fixup for backwards seeking.
* Fix some test failures
* Protptype version using thrift instead of thriftpy
* Not all mergers have cats
* Revert accidental deletion
* remove warnings
* Sort keys in json for metadata
* Check column chunks for categories sizes
* Account for partition dir names with numbers
* Fix map/list doc
* Catch more stats errors
* Prevent pandas auto-names being given to index
- changes from version 0.1.1:
* Add workaround for single-value-partition
* update test
* Simplify and fix for py2
* Use thrift encoding on statistics strings
* remove redundant SNAPPY from supported compressions list
* Fix statistics
* lists again
* Always convert int96 to times
* Update docs
* attribute typo
* Fix definition level
* Add test, clean columns
* Allow optional->optional lists and maps
* Flatten schema to enable loading of non-repeated columns
* Remove extra file
* Fix py2
* Fix "in" filter to cope with strings that could be numbers
* Allow pip install without NumPy or Cython
- changes from version 0.1.0:
* Add ParquetFile attribute documentation
* Fix tests
* Enable append to an empty dataset
* More warning words and check on partition_on
* Do not fail stats if there are no row-groups
* Fix "numpy_dtype"->"numpy_type
* "in" was checking range not exact membership of set
* If metadata gives index, put in columns
* Fix pytest warning
* Fail on ordering dict statistics
* Fix stats filter
* clean test
* Fix ImportWarning on Python 3.6+
* TEST: added updated test file for special strings used in filters
* fix links
* [README]: indicate dependency on LLVM 4.0.x.
* Filter stats had unfortunate converted_type check
* Ignore exceptions in val_to_num
* Also for TODAY
* Very special case for partition: NOW should be kept as string
* Allow partition_on; fix category nuls
* Remove old category key/values on writing
* Implement writing pandas metadata and auto-setting cats/index
* Pandas compatability
* Test and fix for filter on single file
* Do not attempt to recurse into schema elements with zero childrean
-------------------------------------------------------------------
Thu Jun 7 20:41:31 UTC 2018 - jengelh@inai.de
- Fixup grammar./Replace future aims with what it does now.
-------------------------------------------------------------------
Thu May 3 14:07:08 UTC 2018 - toddrme2178@gmail.com
- Use %license tag
-------------------------------------------------------------------
Thu May 25 12:19:26 UTC 2017 - toddrme2178@gmail.com
- Initial version