------------------------------------------------------------------- Tue Apr 26 11:02:27 UTC 2022 - Ben Greiner - Update to 0.8.1 * fix critical buffer overflow crash for large number of columns and long column names * metadata handling * thrift int32 for list * avoid error storing NaNs in column stats ------------------------------------------------------------------- Sat Jan 29 21:36:38 UTC 2022 - Ben Greiner - Update to 0.8.0 * our own cythonic thrift implementation (drop thrift dependency) * more in-place dataset editing ad reordering * python 3.10 support * fixes for multi-index and pandas types - Clean test skips ------------------------------------------------------------------- Sun Jan 16 13:34:53 UTC 2022 - Ben Greiner - Clean specfile from unused python36 conditionals - Require thrift 0.15.0 (+patch) for Python 3.10 compatibility * gh#dask/fastparquet#514 ------------------------------------------------------------------- Sat Nov 27 20:34:53 UTC 2021 - Arun Persaud - update to version 0.7.2: * Ability to remove row-groups in-place for multifile datasets * Accept pandas nullable Float type * allow empty strings and fix min/max when there is no data * make writing statistics optional * row selection in to_pandas() ------------------------------------------------------------------- Sun Aug 8 15:13:55 UTC 2021 - Ben Greiner - Update to version 0.7.1 * Back compile for older versions of numpy * Make pandas nullable types opt-out. The old behaviour (casting to float) is still available with ParquetFile(..., pandas_nulls=False). * Fix time field regression: IsAdjustedToUTC will be False when there is no timezone * Micro improvements to the speed of ParquetFile creation by using simple simple string ops instead of regex and regularising filenames once at the start. Effects datasets with many files. - Release 0.7.0 * This version institutes major, breaking changes, listed here, and incremental fixes and additions. * Reading a directory without a _metadata summary file now works by providing only the directory, instead of a list of constituent files. This change also makes direct of use of fsspec filesystems, if given, to be able to load the footer metadata areas of the files concurrently, if the storage backend supports it, and not directly instantiating intermediate ParquetFile instances * row-level filtering of the data. Whereas previously, only full row-groups could be excluded on the basis of their parquet metadata statistics (if present), filtering can now be done within row-groups too. The syntax is the same as before, allowing for multiple column expressions to be combined with AND|OR, depending on the list structure. This mechanism requires two passes: one to load the columns needed to create the boolean mask, and another to load the columns actually needed in the output. This will not be faster, and may be slower, but in some cases can save significant memory footprint, if a small fraction of rows are considered good and the columns for the filter expression are not in the output. Not currently supported for reading with DataPageV2. * DELTA integer encoding (read-only): experimentally working, but we only have one test file to verify against, since it is not trivial to persuade Spark to produce files encoded this way. DELTA can be extremely compact a representation for slowly varying and/or monotonically increasing integers. * nanosecond resolution times: the new extended "logical" types system supports nanoseconds alongside the previous millis and micros. We now emit these for the default pandas time type, and produce full parquet schema including both "converted" and "logical" type information. Note that all output has isAdjustedToUTC=True, i.e., these are timestamps rather than local time. The time-zone is stored in the metadata, as before, and will be successfully recreated only in fastparquet and (py)arrow. Otherwise, the times will appear to be UTC. For compatibility with Spark, you may still want to use times="int96" when writing. * DataPageV2 writing: now we support both reading and writing. For writing, can be enabled with the environment variable FASTPARQUET_DATAPAGE_V2, or module global fastparquet.writer. DATAPAGE_VERSION and is off by default. It will become on by default in the future. In many cases, V2 will result in better read performance, because the data and page headers are encoded separately, so data can be directly read into the output without addition allocation/copies. This feature is considered experimental, but we believe it working well for most use cases (i.e., our test suite) and should be readable by all modern parquet frameworks including arrow and spark. * pandas nullable types: pandas supports "masked" extension arrays for types that previously could not support NULL at all: ints and bools. Fastparquet used to cast such columns to float, so that we could represent NULLs as NaN; now we use the new(er) masked types by default. This means faster reading of such columns, as there is no conversion. If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable. We already handled writing of nullable columns. ------------------------------------------------------------------- Tue May 18 14:41:46 UTC 2021 - Ben Greiner - Update to version 0.6.3 * no release notes * new requirement: cramjam instead of separate compression libs and their bindings * switch from numba to Cython ------------------------------------------------------------------- Fri Feb 12 14:50:18 UTC 2021 - Dirk Müller - skip python 36 build ------------------------------------------------------------------- Thu Feb 4 17:50:32 UTC 2021 - Jan Engelhardt - Use of "+=" in %check warrants bash as buildshell. ------------------------------------------------------------------- Wed Feb 3 21:43:10 UTC 2021 - Ben Greiner - Skip the import without warning test gh#dask/fastparquet#558 - Apply the Cepl-Strangelove-Parameter to pytest (--import-mode append) ------------------------------------------------------------------- Sat Jan 2 21:04:30 UTC 2021 - Benjamin Greiner - update to version 0.5 * no changelog - update test suite setup -- install the .test module ------------------------------------------------------------------- Sat Jul 18 18:13:53 UTC 2020 - Arun Persaud - specfile: * update requirements: version numbers and added packaging - update to version 0.4.1: * nulls, fixes #504 * deps: Add missing dependency on packaging. (#502) ------------------------------------------------------------------- Thu Jul 9 14:04:10 UTC 2020 - Marketa Calabkova - Update to 0.4.0 * Changed RangeIndex private methods to public ones * Use the python executable used to run the code * Add support for Python 3.8 * support for numba > 0.48 - drop upstreamed patch use-python-exec.patch ------------------------------------------------------------------- Mon Apr 6 06:54:36 UTC 2020 - Tomáš Chvátal - Add patch to use sys.executable and not call py2 binary directly: * use-python-exec.patch ------------------------------------------------------------------- Mon Apr 6 06:50:26 UTC 2020 - Tomáš Chvátal - Update to 0.3.3: * no upstream changelog ------------------------------------------------------------------- Fri Oct 25 17:50:50 UTC 2019 - Todd R - Drop broken python 2 support. - Testing fixes ------------------------------------------------------------------- Sat Aug 3 15:10:41 UTC 2019 - Arun Persaud - update to version 0.3.2: * Only calculate dataset stats once (#453) * Fixes #436 (#452) * Fix a crash if trying to read a file whose created_by value is not set * COMPAT: Fix for pandas DeprecationWarning (#446) * Apply timezone to index (#439) * Handle NaN partition values (#438) * Pandas meta (#431) * Only strip _metadata from end of file path (#430) * Simple nesting fix (#428) * Disallow bad tz on save, warn on load (#427) ------------------------------------------------------------------- Tue Jul 30 14:23:21 UTC 2019 - Todd R - Fix spurious test failure ------------------------------------------------------------------- Mon May 20 15:12:11 CEST 2019 - Matej Cepl - Clean up SPEC file. ------------------------------------------------------------------- Tue Apr 30 14:28:46 UTC 2019 - Todd R - update to 0.3.1 * Add schema == (__eq__) and != (__ne__) methods and tests. * Fix item iteration for decimals * List missing columns in error message * Fix tz being None case - Update to 0.3.0 * Squash some warnings and import failures * Improvements to in and not in operators * Fixes because pandas released ------------------------------------------------------------------- Sat Jan 26 17:05:09 UTC 2019 - Arun Persaud - specfile: * update copyright year - update to version 0.2.1: * Compat for pandas 0.24.0 refactor (#390) * Change OverflowError message when failing on large pages (#387) * Allow for changes in dictionary while reading a row-group column (#367) * Correct pypi project names for compression libraries (#385) ------------------------------------------------------------------- Thu Nov 22 22:47:24 UTC 2018 - Arun Persaud - update to version 0.2.0: * Don't mutate column list input (#383) (#384) * Add optional requirements to extras_require (#380) * Fix "broken link to parquet-format page" (#377) * Add .c file to repo * Handle rows split across 2 pages in the case of a map (#369) * Fixes 370 (#371) * Handle multi-page maps (#368) * Handle zero-column files. Closes #361. (#363) ------------------------------------------------------------------- Sun Sep 30 16:22:56 UTC 2018 - Arun Persaud - specfile: * update url * make %files section more specific - update to version 0.1.6: * Restrict what categories get passed through (#358) * Deep digging for multi-indexes (#356) * allow_empty is the default in >=zstandard-0.9 (#355) * Remove setup_requires from setup.py (#345) * Fixed error if a certain partition is empty, when writing a partioned (#347) * Allow UTF8 column names to be read (#342) * readd test file * Allow for NULL converted type (#340) * Robust partition names (#336) * Fix accidental multiindex * Read multi indexes (#331) * Allow reading from any file-like (#330) * change `parquet-format` link to apache repo (#328) * Remove extra space from api.py (#325) * numba bool fun (#324) - changes from version 0.1.5: * Fix _dtypes to be more efficient, to work with files with lots of columns (#318) * Buildfix (#313) * Use LZ4 block compression for compatibility with parquet-cpp (#314) (#315) * Fix typo in ParquetFile docstring (#312) * Remove annoying print() when reading file with CategoricalDtype index (#311) * Allow lists of multi-file data-sets (#309) * Acceleate dataframe.empty for small/medium sizes (#307) * Include dictionary page in column size (#306) * Fix for selecting columns which were used for partitioning (#304) * Remove occurances of np.fromstring (#303) * Add support for zstandard compression (#296) * Int96time order (#298) - changes from version 0.1.4: * Add handling of keyword arguments for compressor (#294) * Fix setup.py duplication (#295) * Integrate pytest with setup.py (#293) * Get setup.py pytest to work. (#287) * Add LZ4 support (#292) * Update for forthcoming thrift release (#281) * If timezones are in pandas metadata, assign columns as required (#285) * Pandas import (#284) * Copy FMDs instead of mutate (#279) * small fixes (#278) * fixes to get benchmark to work (#276) * backwards compat with Dask * Fix test_time_millis on Windows (#275) * join paths os-independently (#271) * Adds int32 support for object encoding (#268) * Fix a couple small typos in documentation (#267) * Partition order should be sorted (#265) * COMPAT: Update thrift (#264) * Speedups result (#253) * Remove thrift_copy * Define `__copy__` on thrift structures * Update rtd deps - changes from version 0.1.3: * More care over append when partitioning multiple columns * Sep for windows cats filtering * Move pytest imports to tests/ remove requirememnt * Special-case only zeros * Cope with partition values like "07" * fix for s3 * Fix for list of paths rooted in the current directory * add test * Explicit file opens * update docstring * Refactor partition interpretation * py2 fix * Error in test changed * Better error messages when failed to cnovert on write - changes from version 0.1.2: * Revert accidental removal of s3 import * Move thrift things together, and make thrift serializer for pickle * COMPAT: for new pandas CategoricalDtype * Fixup for backwards seeking. * Fix some test failures * Protptype version using thrift instead of thriftpy * Not all mergers have cats * Revert accidental deletion * remove warnings * Sort keys in json for metadata * Check column chunks for categories sizes * Account for partition dir names with numbers * Fix map/list doc * Catch more stats errors * Prevent pandas auto-names being given to index - changes from version 0.1.1: * Add workaround for single-value-partition * update test * Simplify and fix for py2 * Use thrift encoding on statistics strings * remove redundant SNAPPY from supported compressions list * Fix statistics * lists again * Always convert int96 to times * Update docs * attribute typo * Fix definition level * Add test, clean columns * Allow optional->optional lists and maps * Flatten schema to enable loading of non-repeated columns * Remove extra file * Fix py2 * Fix "in" filter to cope with strings that could be numbers * Allow pip install without NumPy or Cython - changes from version 0.1.0: * Add ParquetFile attribute documentation * Fix tests * Enable append to an empty dataset * More warning words and check on partition_on * Do not fail stats if there are no row-groups * Fix "numpy_dtype"->"numpy_type * "in" was checking range not exact membership of set * If metadata gives index, put in columns * Fix pytest warning * Fail on ordering dict statistics * Fix stats filter * clean test * Fix ImportWarning on Python 3.6+ * TEST: added updated test file for special strings used in filters * fix links * [README]: indicate dependency on LLVM 4.0.x. * Filter stats had unfortunate converted_type check * Ignore exceptions in val_to_num * Also for TODAY * Very special case for partition: NOW should be kept as string * Allow partition_on; fix category nuls * Remove old category key/values on writing * Implement writing pandas metadata and auto-setting cats/index * Pandas compatability * Test and fix for filter on single file * Do not attempt to recurse into schema elements with zero childrean ------------------------------------------------------------------- Thu Jun 7 20:41:31 UTC 2018 - jengelh@inai.de - Fixup grammar./Replace future aims with what it does now. ------------------------------------------------------------------- Thu May 3 14:07:08 UTC 2018 - toddrme2178@gmail.com - Use %license tag ------------------------------------------------------------------- Thu May 25 12:19:26 UTC 2017 - toddrme2178@gmail.com - Initial version