perl-Algorithm-KMeans/perl-Algorithm-KMeans.spec

#
# spec file for package perl-Algorithm-KMeans
#
# Copyright (c) 2024 SUSE LLC
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
# upon. The license for this file, and modifications and additions to the
# file, is the same license as for the pristine package itself (unless the
# license for the pristine package is not an Open Source License, in which
# case the license is the MIT License). An "Open Source License" is a
# license that conforms to the Open Source Definition (Version 1.9)
# published by the Open Source Initiative.

# Please submit bugfixes or comments via https://bugs.opensuse.org/
#


%define cpan_name Algorithm-KMeans
Name:           perl-Algorithm-KMeans
Version:        2.50.0
Release:        0
# 2.05 -> normalize -> 2.50.0
%define cpan_version 2.05
License:        Artistic-1.0 OR GPL-1.0-or-later
Summary:        Perl Module for K-Means Clustering
URL:            https://metacpan.org/release/%{cpan_name}
Source0:        https://cpan.metacpan.org/authors/id/A/AV/AVIKAK/%{cpan_name}-%{cpan_version}.tar.gz
Source100:      README.md
BuildArch:      noarch
BuildRequires:  perl
BuildRequires:  perl-macros
BuildRequires:  perl(Graphics::GnuplotIF) >= 1.600.0
BuildRequires:  perl(Math::GSL) >= 0.320.0
BuildRequires:  perl(Math::Random) >= 0.710.0
Requires:       perl(Graphics::GnuplotIF) >= 1.600.0
Requires:       perl(Math::GSL) >= 0.320.0
Requires:       perl(Math::Random) >= 0.710.0
Provides:       perl(Algorithm::KMeans) = %{version}
%undefine       __perllib_provides
%{perl_requires}

%description
Clustering with K-Means takes place iteratively and involves two steps: 1)
assignment of data samples to clusters on the basis of how far the data
samples are from the cluster centers; and 2) Recalculation of the cluster
centers (and cluster covariances if you are using the Mahalanobis distance
metric for clustering).

Obviously, before the two-step approach can proceed, we need to initialize
the the cluster centers. How this initialization is carried out is
important. The module gives you two very different ways for carrying out
this initialization. One option, called the 'smart' option, consists of
subjecting the data to principal components analysis to discover the
direction of maximum variance in the data space. The data points are then
projected on to this direction and a histogram constructed from the
projections. Centers of the smoothed histogram are used to seed the
clustering operation. The other option is to choose the cluster centers
purely randomly. You get the first option if you set 'cluster_seeding' to
'smart' in the constructor, and you get the second option if you set it to
'random'.

How to specify the number of clusters, 'K', is one of the most vexing
issues in any approach to clustering. In some case, we can set 'K' on the
basis of prior knowledge. But, more often than not, no such prior knowledge
is available. When the programmer does not explicitly specify a value for
'K', the approach taken in the current implementation is to try all
possible values between 2 and some largest possible value that makes
statistical sense. We then choose that value for 'K' which yields the best
value for the QoC (Quality of Clustering) metric. It is generally believed
that the largest value for 'K' should not exceed 'sqrt(N/2)' where 'N' is
the number of data samples to be clustered.

What to use for the QoC metric is obviously a critical issue unto itself.
In the current implementation, the value of QoC is the ratio of the average
radius of the clusters and the average distance between the cluster
centers.

Every iterative algorithm requires a stopping criterion. The criterion
implemented here is that we stop iterations when there is no re-assignment
of the data points during the assignment step.

Ordinarily, the output produced by a K-Means clusterer will correspond to a
local minimum for the QoC values, as opposed to a global minimum. The
current implementation protects against that when the module constructor is
called with the 'random' option for 'cluster_seeding' by trying different
randomly selected initial cluster centers and then selecting the one that
gives the best overall QoC value.

A K-Means clusterer will generally produce good results if the overlap
between the clusters is minimal and if each cluster exhibits variability
that is uniform in all directions. When the data variability is different
along the different directions in the data space, the results you obtain
with a K-Means clusterer may be improved by first normalizing the data
appropriately, as can be done in this module when you set the
'do_variance_normalization' option in the constructor. However, as pointed
out elsewhere in this documentation, such normalization may actually
decrease the performance of the clusterer if the overall data variability
along any dimension is more a result of separation between the means than a
consequence of intra-cluster variability.

%prep
%autosetup  -n %{cpan_name}-%{cpan_version}

find . -type f ! -path "*/t/*" ! -name "*.pl" ! -path "*/bin/*" ! -path "*/script/*" ! -path "*/scripts/*" ! -name "configure" -print0 | xargs -0 chmod 644

%build
perl Makefile.PL INSTALLDIRS=vendor
%make_build

%check
make test

%install
%perl_make_install
%perl_process_packlist
%perl_gen_filelist

%files -f %{name}.files
%doc examples README

%changelog