Setup a new node

There are a few steps to follow carefully.

Setup the connection¶

If you've installed a new OS from scratch, you'll need to setup a static IP. To do so:

Connect to the IDRAC interface. To do so, you must be connected to the private VPN.
Identify the number of the node (e.g. node102, node125 etc.). Let's suppose you want to setup node125.
Create a new file in /etc/netplan called however you like. Let's take as example 00-installer-config.yaml: sudo touch /etc/netplan/00-installer-config.yaml

Edit the file by adding the following:

network:
    ethernets:
        eno4:
            addresses:
            - 192.168.0.125/24
            gateway4: 192.168.0.52
            nameservers:
                addresses:
                - 151.100.4.2
                - 151.100.4.13
                search:
                - di.uniroma1.it
        ibp129s0:
            addresses:
            - 10.0.0.25/24
    version: 2

Pay attention to the 192.168.0.125/24 line and the 10.0.0.25/24, as those are the only lines that you have to tune based on the number of your new node, which in this case is 125/25.

Reboot the system.

If everything went right, you should be able to connect using ssh 192.168.0.125 to the new node.

Setup the storage¶

Setting up the storage should be the very first thing to do when setupping a new node as it contains useful scripts that will help you setup every other component.

The storage nodes are currently node127 and node128, which shared directories are data1 and data2 respectively. These directories are exported using nfs. To install nfs on a new node, you can do the following:

install nfs if it isn't already installed:
```
sudo apt install nfs-common
```
Create the new folders to mount:
```
sudo mkdir /data1 /data2
```

Edit the /etc/fstab file by adding the following lines:

sudo echo -e "10.0.0.27:/data1/data1 /data1 nfs rw,sync,hard,intr,_netdev 0 0" >> /etc/fstab
sudo echo -e "10.0.0.28:/data2/data2 /data2 nfs rw,sync,hard,intr,_netdev 0 0" >> /etc/fstab
sudo echo -e "10.0.0.27:/data1/home /home nfs rw,sync,hard,intr,_netdev 0 0" >> /etc/fstab

Reload /etc/fstab with the following command:
```
sudo mount -a
```

If everything went right, at the next login you should boot up into the shared /home/<user>.

Update the OS¶

Currently, every node runs on an Ubuntu LTS distro. It is pretty straightforward to update Ubuntu following online tutorials, but you can do it running the update_os.sh script.

Script to automatically update the OS

update_os.sh
#!/bin/bash
set -e

sudo sed -i 's/^Prompt=.*/Prompt=normal/' /etc/update-manager/release-upgrades
sudo apt update -y && sudo apt upgrade -y && sudo apt dist-upgrade -y
sudo do-release-upgrade

Setup Munge¶

Munge is needed from Slurm to validate credentials. While Munge is already setupped on the controller node, you can install it on another node by using a script.

Copy the Munge key from /etc/munge/munge.key on the controller node to the node you want to setup:
```
sudo cp /etc/munge/munge.key ~/munge.key
```

Run the install_munge_on_other_node.sh script on the new node to install Munge and copy the key.

Script to automatically install Munge on a new node

install_munge_on_other_node.sh
#!/bin/bash
set -e

cp munge.key /etc/munge.key
apt install munge libmunge2 libmunge-dev
chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
chmod 0755 /run/munge/
chmod 0700 /etc/munge/munge.key
chown -R munge: /etc/munge/munge.key
systemctl enable munge
systemctl restart munge
munge -n | sudo ssh guest@192.168.0.127 unmunge

If everything went right, you should see a success message similar to the following:

STATUS:           Success (0)
ENCODE_HOST:      storageserver1 (127.0.1.1)
ENCODE_TIME:      2025-06-11 08:56:11 +0000 (1749632171)
DECODE_TIME:      2025-06-11 08:56:14 +0000 (1749632174)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

Setup Slurm¶

Install Slurm¶

To install Slurm on a new node you can use the install_slurm.sh script by running sudo bash install_slurm.sh <role>, where <role> might be one of controller, compute, login, db. Usually it would be compute.

Script to automatically install Slurm on a new node

install_slurm.sh
#!/bin/bash
set -e

# arguments parsing
if [[ ! $# == 1 ]] ; then
    echo "You have to supply exactly one argument"
    exit 1
fi

NODE_TYPE=$1

case "$NODE_TYPE" in
    controller|compute|login|db)
        echo "Setting up a $NODE_TYPE node"
        ;;
    *)
        echo "Invalid option. Please enter one of: controller, compute, login, db."
        exit 1
        ;;
esac

# uninstall previous slurm versions and install pre-requisites
apt install -y build-essential fakeroot devscripts equivs

# download slurm
cd /tmp
if [ -d slurm-24.11.1 ]; then
  echo "Removing /tmp/slurm-24.11.1 directory to perform a clean installation"
  rm -rf slurm-24.11.1
fi
wget -O slurm-24.11.1.tar.bz2 https://download.schedmd.com/slurm/slurm-24.11.1.tar.bz2
tar -xaf slurm-24.11.1.tar.bz2

# install slurm
cd slurm-24.11.1
mk-build-deps -i debian/control
debuild -b -uc -us
mkdir /var/slurmproc || true

# install the packages based on the type of node
cd ..
dpkg -i slurm-smd_24.11.1-1_amd64.deb
case "$NODE_TYPE" in
    controller)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb slurm-smd-slurmctld_24.11.1-1_amd64.deb
        ;;
    compute)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb slurm-smd-slurmd_24.11.1-1_amd64.deb
        ;;
    login)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb
        ;;
    db)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb slurm-smd-slurmdbd_24.11.1-1_amd64.deb
        ;;
esac
echo "SLURM has been installed successfully"

# removes trash
rm -rf /tmp/slurm*

Update `slurm.conf`¶

Now it is time to update /etc/slurm/slurm.conf on each node to let Slurm know about the new node in the cluster.

To ease things up, it is possible to update just the file on the controller node and then use Ansible to propagate it to all new nodes.

Update the Ansible inventory located in ~/ansible/inventory.ini with the new node.

The current Ansible inventory

inventory.ini
[all:vars]
ansible_connection=ssh
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
ansible_user=guest

[cluster]
168.0.[103:116] ib_address=10.0.0.[3:16]
168.0.[119:124] ib_address=10.0.0.[19:24]
168.0.126 ib_address=10.0.0.26
168.0.130 ib_address=10.0.0.30
168.0.[132:135] ib_address=10.0.0.[32:35]
168.0.[137:143] ib_address=10.0.0.[37:43]
168.0.[145:149] ib_address=10.0.0.[45:49]
168.0.151 ib_address=10.0.0.51

[login_nodes]
168.0.102 ib_address=10.0.0.2

[controller_nodes]
168.0.104 ib_address=10.0.0.4
168.0.115 ib_address=10.0.0.15

[db_nodes]
168.0.105 ib_address=10.0.0.5

[compute_nodes]
168.0.103 ib_address=10.0.0.3
168.0.[106:114] ib_address=10.0.0.[6:14]
168.0.116 ib_address=10.0.0.16
168.0.[119:124] ib_address=10.0.0.[19:24]
168.0.126 ib_address=10.0.0.26
168.0.130 ib_address=10.0.0.30
168.0.[132:135] ib_address=10.0.0.[32:35]
168.0.[137:143] ib_address=10.0.0.[37:43]
168.0.[145:149] ib_address=10.0.0.[45:49]
168.0.151 ib_address=10.0.0.51

[nodes_without_gpu]
168.0.[103:109] ib_address=10.0.0.[3:9]
168.0.[114:116] ib_address=10.0.0.[14:16]
168.0.121 ib_address=10.0.0.21
168.0.130 ib_address=10.0.0.30
168.0.[132:135] ib_address=10.0.0.[32:35]
168.0.[137:143] ib_address=10.0.0.[37:43]
168.0.[145:149] ib_address=10.0.0.[45:49]
168.0.151 ib_address=10.0.0.51

[nodes_with_1_gpu]
168.0.110 ib_address=10.0.0.10
168.0.122 ib_address=10.0.0.22

[nodes_with_2_gpus]
168.0.[111:113] ib_address=10.0.0.[11:13]
168.0.120 ib_address=10.0.0.20
168.0.[123:124] ib_address=10.0.0.[23:24]
168.0.126 ib_address=10.0.0.26

[nodes_with_fpga]
168.0.119 ib_address=10.0.0.19

[admin_partition]
168.0.121 ib_address=10.0.0.21

[department_only_partition]
168.0.103 ib_address=10.0.0.3
168.0.[112:114] ib_address=10.0.0.[12:14]
168.0.116 ib_address=10.0.0.16
168.0.120 ib_address=10.0.0.20
168.0.[123:124] ib_address=10.0.0.[23:24]
168.0.126 ib_address=10.0.0.26
168.0.130 ib_address=10.0.0.30
168.0.[132:135] ib_address=10.0.0.[32:35]
168.0.[137:143] ib_address=10.0.0.[37:43]

[multicore_partition]
168.0.[106:107] ib_address=10.0.0.[6:7]

[students_partition]
168.0.[108:111] ib_address=10.0.0.[8:11]
168.0.122 ib_address=10.0.0.22
168.0.[145:149] ib_address=10.0.0.[45:49]
168.0.151 ib_address=10.0.0.51

Update the Slurm configuration file located in /etc/slurm/slurm.conf on the controller.

The current slurm.conf

slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=di_hpc_salaria
SlurmctldHost=node115(10.0.0.15)
SlurmctldHost=node104(10.0.0.4)
SlurmctldParameters=enable_configless
#
#MailProg=/bin/mail
#MpiDefault=
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/slurmproc/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/slurmproc/slurmctld
#SwitchType=
TaskPlugin=task/affinity,task/cgroup
AuthType=auth/munge
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=10.0.0.5
AccountingStorageUser=slurm
AccountingStoragePort=6819
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
# PRIORITIES and FAIRSHARE
PriorityType=priority/multifactor
PriorityWeightFairshare=100000       
PriorityWeightAge=1000             
PriorityWeightQOS=50000           
PriorityWeightJobSize=500             
PriorityWeightPartition=10000           
PriorityDecayHalfLife=7-0           
PriorityCalcPeriod=5:00           
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# COMPUTE NODES
GresTypes=gpu

NodeName=node103 NodeAddr=10.0.0.3 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node106 NodeAddr=10.0.0.6 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node107 NodeAddr=10.0.0.7 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node108 NodeAddr=10.0.0.8 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node109 NodeAddr=10.0.0.9 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node110 NodeAddr=10.0.0.10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1
NodeName=node111 NodeAddr=10.0.0.11 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node112 NodeAddr=10.0.0.12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node113 NodeAddr=10.0.0.13 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node114 NodeAddr=10.0.0.14 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027676
NodeName=node116 NodeAddr=10.0.0.16 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node118 NodeAddr=10.0.0.18 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node119 NodeAddr=10.0.0.19 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node120 NodeAddr=10.0.0.20 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node121 NodeAddr=10.0.0.21 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node122 NodeAddr=10.0.0.22 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1
NodeName=node123 NodeAddr=10.0.0.23 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node124 NodeAddr=10.0.0.24 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node125 NodeAddr=192.168.0.125 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node126 NodeAddr=10.0.0.26 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node130 NodeAddr=10.0.0.30 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node132 NodeAddr=10.0.0.32 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node133 NodeAddr=10.0.0.33 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node134 NodeAddr=10.0.0.34 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node135 NodeAddr=10.0.0.35 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node137 NodeAddr=10.0.0.37 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node138 NodeAddr=10.0.0.38 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node139 NodeAddr=10.0.0.39 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node140 NodeAddr=10.0.0.40 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node141 NodeAddr=10.0.0.41 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node142 NodeAddr=10.0.0.42 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node143 NodeAddr=10.0.0.43 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node145 NodeAddr=10.0.0.45 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node146 NodeAddr=10.0.0.46 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node147 NodeAddr=10.0.0.47 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node148 NodeAddr=10.0.0.48 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node149 NodeAddr=10.0.0.49 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node151 NodeAddr=10.0.0.51 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566

PartitionName=admin Nodes="" MaxTime=INFINITE State=UP AllowGroups=sudo DefMemPerNode=8192
PartitionName=department_only Nodes=ALL MaxTime=3-0 State=UP AllowGroups=sudo,department,group_leaders DefMemPerNode=8192
PartitionName=multicore Nodes=node[106-107] State=UP MaxTime=0-6 DefMemPerNode=1024 MaxNodes=1
PartitionName=fpga Nodes=node[119] State=UP MaxTime=1-0 MaxNodes=1 ExclusiveUser=YES OverSubscribe=EXCLUSIVE DefMemPerNode=257565
PartitionName=students Nodes=node[106-111,118,122,145-149,151] State=UP Default=TRUE MaxTime=1-0 DefMemPerNode=1024 MaxMemPerNode=32768 MaxCPUsPerNode=8 MaxNodes=1

Propagate slurm.conf to each other node on the cluster using the bash run_ansible_playbook.sh propagate_slurm_conf.yaml command.

The current propagate_slurm_conf.yaml playbook

propagate_slurm_conf.yaml
- name: Propagate slurm.conf
  hosts:
  - controller_nodes
  - login_nodes
  - db_nodes
  - compute_nodes
  become: true
  become_user: root
  tasks:
  - name: Creates /etc/slurm if does not exists
    file:
      path: /etc/slurm
      state: directory
  - name: Copy slurm.conf from central node to other nodes
    copy:
      src: /etc/slurm/slurm.conf
      dest: /etc/slurm/slurm.conf
  - name: Copy gres.conf from central node to other nodes
    copy:
      src: /etc/slurm/gres.conf
      dest: /etc/slurm/gres.conf
  - name: Ensure there is no oci.conf
    file:
      path: /etc/slurm/oci.conf
      state: absent
  - name: Ensure 'slurm' and 'users' group exists
    group:
      name: "{{ item }}"
      state: present
    loop:
    - slurm
    - users
  - name: Set slurm user uid correctly
    shell:
      cmd: sudo usermod -u 1217 slurm
  # - name: Reconfigure Slurm for changes to take effect
  #   shell:
  #     cmd: scontrol reconfigure
  #   delegate_to: "{{ ansible_hostname }}"

Run a sudo scontrol reconfigure on the controller node to refresh the configuration on each node.

Problems with Slurm?¶

It may happen that Slurm will complain about something. You can debug almost everything by monitoring the status of the slurmd daemon with systemctl status slurmd.

One thing that may happen is that you've set wrong values related to the hardware of the node in the slurm.conf. Make sure all values are correct.

Another thing that may happen is that the ID of the slurm user is invalid.

Anyway, restarting the daemon through systemctl restart slurmd often solves some problems.