123 lines
5.3 KiB
RPMSpec
123 lines
5.3 KiB
RPMSpec
#
|
|
# spec file for package perl-Algorithm-KMeans
|
|
#
|
|
# Copyright (c) 2024 SUSE LLC
|
|
#
|
|
# All modifications and additions to the file contributed by third parties
|
|
# remain the property of their copyright owners, unless otherwise agreed
|
|
# upon. The license for this file, and modifications and additions to the
|
|
# file, is the same license as for the pristine package itself (unless the
|
|
# license for the pristine package is not an Open Source License, in which
|
|
# case the license is the MIT License). An "Open Source License" is a
|
|
# license that conforms to the Open Source Definition (Version 1.9)
|
|
# published by the Open Source Initiative.
|
|
|
|
# Please submit bugfixes or comments via https://bugs.opensuse.org/
|
|
#
|
|
|
|
|
|
%define cpan_name Algorithm-KMeans
|
|
Name: perl-Algorithm-KMeans
|
|
Version: 2.50.0
|
|
Release: 0
|
|
# 2.05 -> normalize -> 2.50.0
|
|
%define cpan_version 2.05
|
|
License: Artistic-1.0 OR GPL-1.0-or-later
|
|
Summary: Perl Module for K-Means Clustering
|
|
URL: https://metacpan.org/release/%{cpan_name}
|
|
Source0: https://cpan.metacpan.org/authors/id/A/AV/AVIKAK/%{cpan_name}-%{cpan_version}.tar.gz
|
|
Source100: README.md
|
|
BuildArch: noarch
|
|
BuildRequires: perl
|
|
BuildRequires: perl-macros
|
|
BuildRequires: perl(Graphics::GnuplotIF) >= 1.600.0
|
|
BuildRequires: perl(Math::GSL) >= 0.320.0
|
|
BuildRequires: perl(Math::Random) >= 0.710.0
|
|
Requires: perl(Graphics::GnuplotIF) >= 1.600.0
|
|
Requires: perl(Math::GSL) >= 0.320.0
|
|
Requires: perl(Math::Random) >= 0.710.0
|
|
Provides: perl(Algorithm::KMeans) = %{version}
|
|
%undefine __perllib_provides
|
|
%{perl_requires}
|
|
|
|
%description
|
|
Clustering with K-Means takes place iteratively and involves two steps: 1)
|
|
assignment of data samples to clusters on the basis of how far the data
|
|
samples are from the cluster centers; and 2) Recalculation of the cluster
|
|
centers (and cluster covariances if you are using the Mahalanobis distance
|
|
metric for clustering).
|
|
|
|
Obviously, before the two-step approach can proceed, we need to initialize
|
|
the the cluster centers. How this initialization is carried out is
|
|
important. The module gives you two very different ways for carrying out
|
|
this initialization. One option, called the 'smart' option, consists of
|
|
subjecting the data to principal components analysis to discover the
|
|
direction of maximum variance in the data space. The data points are then
|
|
projected on to this direction and a histogram constructed from the
|
|
projections. Centers of the smoothed histogram are used to seed the
|
|
clustering operation. The other option is to choose the cluster centers
|
|
purely randomly. You get the first option if you set 'cluster_seeding' to
|
|
'smart' in the constructor, and you get the second option if you set it to
|
|
'random'.
|
|
|
|
How to specify the number of clusters, 'K', is one of the most vexing
|
|
issues in any approach to clustering. In some case, we can set 'K' on the
|
|
basis of prior knowledge. But, more often than not, no such prior knowledge
|
|
is available. When the programmer does not explicitly specify a value for
|
|
'K', the approach taken in the current implementation is to try all
|
|
possible values between 2 and some largest possible value that makes
|
|
statistical sense. We then choose that value for 'K' which yields the best
|
|
value for the QoC (Quality of Clustering) metric. It is generally believed
|
|
that the largest value for 'K' should not exceed 'sqrt(N/2)' where 'N' is
|
|
the number of data samples to be clustered.
|
|
|
|
What to use for the QoC metric is obviously a critical issue unto itself.
|
|
In the current implementation, the value of QoC is the ratio of the average
|
|
radius of the clusters and the average distance between the cluster
|
|
centers.
|
|
|
|
Every iterative algorithm requires a stopping criterion. The criterion
|
|
implemented here is that we stop iterations when there is no re-assignment
|
|
of the data points during the assignment step.
|
|
|
|
Ordinarily, the output produced by a K-Means clusterer will correspond to a
|
|
local minimum for the QoC values, as opposed to a global minimum. The
|
|
current implementation protects against that when the module constructor is
|
|
called with the 'random' option for 'cluster_seeding' by trying different
|
|
randomly selected initial cluster centers and then selecting the one that
|
|
gives the best overall QoC value.
|
|
|
|
A K-Means clusterer will generally produce good results if the overlap
|
|
between the clusters is minimal and if each cluster exhibits variability
|
|
that is uniform in all directions. When the data variability is different
|
|
along the different directions in the data space, the results you obtain
|
|
with a K-Means clusterer may be improved by first normalizing the data
|
|
appropriately, as can be done in this module when you set the
|
|
'do_variance_normalization' option in the constructor. However, as pointed
|
|
out elsewhere in this documentation, such normalization may actually
|
|
decrease the performance of the clusterer if the overall data variability
|
|
along any dimension is more a result of separation between the means than a
|
|
consequence of intra-cluster variability.
|
|
|
|
%prep
|
|
%autosetup -n %{cpan_name}-%{cpan_version}
|
|
|
|
find . -type f ! -path "*/t/*" ! -name "*.pl" ! -path "*/bin/*" ! -path "*/script/*" ! -path "*/scripts/*" ! -name "configure" -print0 | xargs -0 chmod 644
|
|
|
|
%build
|
|
perl Makefile.PL INSTALLDIRS=vendor
|
|
%make_build
|
|
|
|
%check
|
|
make test
|
|
|
|
%install
|
|
%perl_make_install
|
|
%perl_process_packlist
|
|
%perl_gen_filelist
|
|
|
|
%files -f %{name}.files
|
|
%doc examples README
|
|
|
|
%changelog
|