# # spec file for package python-tokenizers # # Copyright (c) 2024 SUSE LLC # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed # upon. The license for this file, and modifications and additions to the # file, is the same license as for the pristine package itself (unless the # license for the pristine package is not an Open Source License, in which # case the license is the MIT License). An "Open Source License" is a # license that conforms to the Open Source Definition (Version 1.9) # published by the Open Source Initiative. # Please submit bugfixes or comments via https://bugs.opensuse.org/ # %{?!python_module:%define python_module() python-%{**} python3-%{**}} Name: python-tokenizers Version: 0.19.1 Release: 0 Summary: Provides an implementation of today's most used tokenizers License: Apache-2.0 URL: https://github.com/huggingface/tokenizers Source0: https://github.com/huggingface/tokenizers/archive/refs/tags/v%{version}.tar.gz#/tokenizers-%{version}.tar.gz Source1: vendor.tar.gz BuildRequires: %{python_module devel} BuildRequires: %{python_module maturin} BuildRequires: %{python_module pip} BuildRequires: %{python_module setuptools} BuildRequires: cargo-packaging BuildRequires: gcc-c++ BuildRequires: fdupes BuildRequires: python-rpm-macros BuildRequires: python-rpm-macros %python_subpackages %description Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. * Train new vocabularies and tokenize, using today's most used tokenizers. * Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. * Easy to use, but also extremely versatile. * Designed for research and production. * Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token. * Does all the pre-processing: Truncate, Pad, add the special tokens your model needs. %prep %autosetup -p1 -n tokenizers-%{version} cd bindings/python tar xzf %{S:1} %build cd bindings/python %pyproject_wheel %install cd bindings/python %pyproject_install %python_expand %fdupes %{buildroot}/%{$python_sitearch}/* %check %files %{python_files} %license LICENSE %doc README.md %{python_sitearch}/tokenizers* %changelog