# # spec file for package perl-Algorithm-KMeans # # Copyright (c) 2024 SUSE LLC # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed # upon. The license for this file, and modifications and additions to the # file, is the same license as for the pristine package itself (unless the # license for the pristine package is not an Open Source License, in which # case the license is the MIT License). An "Open Source License" is a # license that conforms to the Open Source Definition (Version 1.9) # published by the Open Source Initiative. # Please submit bugfixes or comments via https://bugs.opensuse.org/ # %define cpan_name Algorithm-KMeans Name: perl-Algorithm-KMeans Version: 2.50.0 Release: 0 # 2.05 -> normalize -> 2.50.0 %define cpan_version 2.05 License: Artistic-1.0 OR GPL-1.0-or-later Summary: Perl Module for K-Means Clustering URL: https://metacpan.org/release/%{cpan_name} Source0: https://cpan.metacpan.org/authors/id/A/AV/AVIKAK/%{cpan_name}-%{cpan_version}.tar.gz Source100: README.md BuildArch: noarch BuildRequires: perl BuildRequires: perl-macros BuildRequires: perl(Graphics::GnuplotIF) >= 1.600.0 BuildRequires: perl(Math::GSL) >= 0.320.0 BuildRequires: perl(Math::Random) >= 0.710.0 Requires: perl(Graphics::GnuplotIF) >= 1.600.0 Requires: perl(Math::GSL) >= 0.320.0 Requires: perl(Math::Random) >= 0.710.0 Provides: perl(Algorithm::KMeans) = %{version} %undefine __perllib_provides %{perl_requires} %description Clustering with K-Means takes place iteratively and involves two steps: 1) assignment of data samples to clusters on the basis of how far the data samples are from the cluster centers; and 2) Recalculation of the cluster centers (and cluster covariances if you are using the Mahalanobis distance metric for clustering). Obviously, before the two-step approach can proceed, we need to initialize the the cluster centers. How this initialization is carried out is important. The module gives you two very different ways for carrying out this initialization. One option, called the 'smart' option, consists of subjecting the data to principal components analysis to discover the direction of maximum variance in the data space. The data points are then projected on to this direction and a histogram constructed from the projections. Centers of the smoothed histogram are used to seed the clustering operation. The other option is to choose the cluster centers purely randomly. You get the first option if you set 'cluster_seeding' to 'smart' in the constructor, and you get the second option if you set it to 'random'. How to specify the number of clusters, 'K', is one of the most vexing issues in any approach to clustering. In some case, we can set 'K' on the basis of prior knowledge. But, more often than not, no such prior knowledge is available. When the programmer does not explicitly specify a value for 'K', the approach taken in the current implementation is to try all possible values between 2 and some largest possible value that makes statistical sense. We then choose that value for 'K' which yields the best value for the QoC (Quality of Clustering) metric. It is generally believed that the largest value for 'K' should not exceed 'sqrt(N/2)' where 'N' is the number of data samples to be clustered. What to use for the QoC metric is obviously a critical issue unto itself. In the current implementation, the value of QoC is the ratio of the average radius of the clusters and the average distance between the cluster centers. Every iterative algorithm requires a stopping criterion. The criterion implemented here is that we stop iterations when there is no re-assignment of the data points during the assignment step. Ordinarily, the output produced by a K-Means clusterer will correspond to a local minimum for the QoC values, as opposed to a global minimum. The current implementation protects against that when the module constructor is called with the 'random' option for 'cluster_seeding' by trying different randomly selected initial cluster centers and then selecting the one that gives the best overall QoC value. A K-Means clusterer will generally produce good results if the overlap between the clusters is minimal and if each cluster exhibits variability that is uniform in all directions. When the data variability is different along the different directions in the data space, the results you obtain with a K-Means clusterer may be improved by first normalizing the data appropriately, as can be done in this module when you set the 'do_variance_normalization' option in the constructor. However, as pointed out elsewhere in this documentation, such normalization may actually decrease the performance of the clusterer if the overall data variability along any dimension is more a result of separation between the means than a consequence of intra-cluster variability. %prep %autosetup -n %{cpan_name}-%{cpan_version} find . -type f ! -path "*/t/*" ! -name "*.pl" ! -path "*/bin/*" ! -path "*/script/*" ! -path "*/scripts/*" ! -name "configure" -print0 | xargs -0 chmod 644 %build perl Makefile.PL INSTALLDIRS=vendor %make_build %check make test %install %perl_make_install %perl_process_packlist %perl_gen_filelist %files -f %{name}.files %doc examples README %changelog