slurm/README_Testsuite.md
Egbert Eich fc209e050f - updated to new release 24.05.0 with following major changes
- IMPORTANT NOTES:
  If using the slurmdbd (Slurm DataBase Daemon) you must update
  this first.  NOTE: If using a backup DBD you must start the
  primary first to do any database conversion, the backup will not
  start until this has happened.  The 24.05 slurmdbd will work
  with Slurm daemons of version 23.02 and above.  You will not
  need to update all clusters at the same time, but it is very
  important to update slurmdbd first and having it running before
  updating any other clusters making use of it.
- HIGHLIGHTS
  * Federation - allow client command operation when slurmdbd is
    unavailable.
  * burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in
    and slurm_bb_test_data_out. The syntax and use of the new hooks
    are documented in etc/burst_buffer.lua.example. These are
    required to exist. slurmctld now checks on startup if the
    burst_buffer.lua script loads and contains all required hooks;
    slurmctld will exit with a fatal error if this is not
    successful. Added PollInterval to burst_buffer.conf. Removed
    the arbitrary limit of 512 copies of the script running
    simultaneously.
  * Add QOS limit MaxTRESRunMinsPerAccount. 
  * Add QOS limit MaxTRESRunMinsPerUser.
  * Add ELIGIBLE environment variable to jobcomp/script plugin.
  * Always use the QOS name for SLURM_JOB_QOS environment variables.
    Previously the batch environment would use the description field,
    which was usually equivalent to the name. 
  * cgroup/v2 - Require dbus-1 version >= 1.11.16.
  * Allow NodeSet names to be used in SuspendExcNodes.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=294
2024-10-14 10:03:00 +00:00

126 lines
4.6 KiB
Markdown

# Running the Slurm 'expect' Testsuite
The ```slurm-testsuite``` package contains the Slurm expect test suite.
This package is meant to be installed on a test setup only, it should
NEVER BE INSTALLED ON A REGULAR OR EVEN PRODUCTION SYSTEM.
SUSE uses this package to determine regressions and for quality assurance.
The results are monitored and evaluated regularly in house.
A specific configuration is required to run this test suite, this document
attempts to describe the steps needed.
A small subset of tests is currently failing. The reasons are yet to be
determined.
Please do not file bug reports based on test results!
The testsuite is preconfigured to work with 4 nodes: ```node01```,...,
```node04```. ```node01``` serves as control and compute node. The slurm
configuration, home, and the test suite are shared across the nodes.
The test suite should be mounted under /home (to make ```sgather``` work
correctly).
For tests involving MPI this test suite currently uses OpenMPI version 4.
## Install and set up the Base System
1. Prepare image with a minimal text mode installation.
2. Install, enable and start sshd and make sure root is able to log in
without password across all nodes.
```
# zypper install openssh-server openssh-clients
# systemctl enable --now sshd
# ssh-keygen -t rsa -f .ssh/id_rsa -N
# cat .ssh/id_rsa.pub >> .ssh/authorized_keys
```
3. Create a test user 'auser' allow ssh from/to root:
```
# useradd -m auser
# cp -r /root/.ssh /home/auser
```
4. Set up a persistent network if to obtain the network address and
hostname thru DHCP:
```
# echo 'SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", '\
'ATTR{address}=="?*", ATTR{dev_id}=="0x0", ATTR{type}=="1",'\
' KERNEL=="?*", NAME="lan0" >> /etc/udev/rules.d/70-persistent-net.rules
# cat > /etc/sysconfig/network/ifcfg-lan0 <<EOF
BOOTPROTO='dhcp'
MTU=''
REMOTE_IPADDR=''
STARTMODE='onboot'
EOF
# sed -i 's/DHCLIENT_SET_HOSTNAME="no"/DHCLIENT_SET_HOSTNAME="yes"/' \
/etc/sysconfig/network/dhcp
```
## Install and set up the Slurm specific Environment
1. Install package slurm-testsuite.
2. Set up, enable & start mariadb, add slurm accounting database:
```
# sed -i -e "/^bind-address/s@\(^.*$\)@# \1@" /etc/my.cnf
# systemctl start mariadb
# mysql -uroot -e "create user 'slurm'@'node01' identified by 'linux';"
# mysql -uroot -e "create database slurm_acct_db;"
# mysql -uroot -e "grant all on slurm_acct_db.* TO 'slurm'@'node01';"
```
3. Set up shared home, testsuite and slurm config directories, install and
enable NFS kernel server.
```
# mkdir -p /srv/home
# mv /home/auser /srv/home
# cat >> /etc/exports <<EOF
/srv/home *(rw,no_subtree_check,sync,no_root_squash)
/srv/slurm-testsuite *(rw,no_subtree_check,sync,no_root_squash)
/srv/slurm-testsuite/shared *(rw,no_subtree_check,sync,no_root_squash)
/srv/slurm-testsuite/config *(rw,no_subtree_check,sync,no_root_squash)
EOF
# cat >> /etc/fstab <<EOF
node01:/srv/home /home nfs sync,hard,rw 0 0
node01:/srv/slurm-testsuite/config /etc/slurm nfs sync,hard,rw 0 0
node01:/srv/slurm-testsuite/shared /var/lib/slurm/shared nfs sync,hard,rw 0 0
node01:/srv/slurm-testsuite /home/slurm-testsuite nfs sync,hard,rw 0 0
EOF
# zypper install nfs-kernel-server
# systemctl enable nfs-server
```
4. Enable munge and slurmd:
```
# systemctl enable munge
# systemctl enable slurmd
```
# Clone Nodes and bring up Test System
1. Now halt the system and duplicate it 3 times.
2. Set up the dhcp server and make sure the nodes receive the hostnames
``node01```,..., ```node04```.
4. Boot all 4 nodes (start with ```node01```).
5. On ```node01```, log in as ```root``` and run ```setup-testsuite.sh```:
```
# ./setup-testsuite.sh
```
6. Load the environment and run the tests as user 'slurm':
```
# sudo -s -u slurm
$ module load gnu openmpi
$ cd /home/slurm-testsuite/testsuite/expect
$ ./regression.py
```
There are a number of tests which require a different configuration
and thus will be skipped.
For a number of these, the alternatives are documented in the config
file shipped with this package.
A small number of tests fail for yet unknown reasons.
Also, when run sequentially, some tests may fail intermittendly as the
test suite is not race free. Often the reason for this is that tests
try to determine the availability of resources and may behave incorrectly
if an insufficient number is marked 'idle'. This problem may be less
pronounced when more resources (nodes) are available. Usually, these
issues will not show when tests are run manually. Therefore, it is important
the re-check failed tests manually.