Setup a new node
There are a few steps to follow carefully.
Setup the connection¶
If you've installed a new OS from scratch, you'll need to setup a static IP. To do so:
-
Connect to the IDRAC interface. To do so, you must be connected to the private VPN.
-
Identify the number of the node (e.g.
node102
,node125
etc.). Let's suppose you want to setupnode125
. -
Create a new file in
/etc/netplan
called however you like. Let's take as example00-installer-config.yaml
:sudo touch /etc/netplan/00-installer-config.yaml
-
Edit the file by adding the following:
network: ethernets: eno4: addresses: - 192.168.0.125/24 gateway4: 192.168.0.52 nameservers: addresses: - 151.100.4.2 - 151.100.4.13 search: - di.uniroma1.it ibp129s0: addresses: - 10.0.0.25/24 version: 2
Pay attention to the
192.168.0.125/24
line and the10.0.0.25/24
, as those are the only lines that you have to tune based on the number of your new node, which in this case is125/25
. -
Reboot the system.
If everything went right, you should be able to connect using ssh 192.168.0.125
to the new node.
Setup the storage¶
Setting up the storage should be the very first thing to do when setupping a new node as it contains useful scripts that will help you setup every other component.
The storage nodes are currently node127
and node128
, which shared directories are data1
and data2
respectively.
These directories are exported using nfs
.
To install nfs
on a new node, you can do the following:
- install
nfs
if it isn't already installed: - Create the new folders to mount:
- Edit the
/etc/fstab
file by adding the following lines: - Reload
/etc/fstab
with the following command:
If everything went right, at the next login you should boot up into the shared /home/<user>
.
Update the OS¶
Currently, every node runs on an Ubuntu LTS distro.
It is pretty straightforward to update Ubuntu following online tutorials, but you can do it running the update_os.sh
script.
Script to automatically update the OS
Setup Munge¶
Munge is needed from Slurm to validate credentials. While Munge is already setupped on the controller node, you can install it on another node by using a script.
-
Copy the Munge key from
/etc/munge/munge.key
on the controller node to the node you want to setup: -
Run the
install_munge_on_other_node.sh
script on the new node to install Munge and copy the key.Script to automatically install Munge on a new node
If everything went right, you should see a success message similar to the following:
STATUS: Success (0)
ENCODE_HOST: storageserver1 (127.0.1.1)
ENCODE_TIME: 2025-06-11 08:56:11 +0000 (1749632171)
DECODE_TIME: 2025-06-11 08:56:14 +0000 (1749632174)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
Setup Slurm¶
Install Slurm¶
To install Slurm on a new node you can use the install_slurm.sh
script by running sudo bash install_slurm.sh <role>
, where <role>
might be one of controller
, compute
, login
, db
.
Usually it would be compute
.
Script to automatically install Slurm on a new node
Update slurm.conf
¶
Now it is time to update /etc/slurm/slurm.conf
on each node to let Slurm know about the new node in the cluster.
To ease things up, it is possible to update just the file on the controller node and then use Ansible to propagate it to all new nodes.
-
Update the Ansible inventory located in
~/ansible/inventory.ini
with the new node.The current Ansible inventory
-
Update the Slurm configuration file located in
/etc/slurm/slurm.conf
on the controller.The current
slurm.conf
slurm.conf 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=di_hpc_salaria SlurmctldHost=node115(10.0.0.15) SlurmctldHost=node104(10.0.0.4) SlurmctldParameters=enable_configless # #MailProg=/bin/mail #MpiDefault= #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/slurmproc/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/slurmproc/slurmctld #SwitchType= TaskPlugin=task/affinity,task/cgroup AuthType=auth/munge # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=10.0.0.5 AccountingStorageUser=slurm AccountingStoragePort=6819 #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup #SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log # # PRIORITIES and FAIRSHARE PriorityType=priority/multifactor PriorityWeightFairshare=100000 PriorityWeightAge=1000 PriorityWeightQOS=50000 PriorityWeightJobSize=500 PriorityWeightPartition=10000 PriorityDecayHalfLife=7-0 PriorityCalcPeriod=5:00 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # COMPUTE NODES GresTypes=gpu NodeName=node103 NodeAddr=10.0.0.3 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node106 NodeAddr=10.0.0.6 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node107 NodeAddr=10.0.0.7 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node108 NodeAddr=10.0.0.8 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node109 NodeAddr=10.0.0.9 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node110 NodeAddr=10.0.0.10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1 NodeName=node111 NodeAddr=10.0.0.11 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node112 NodeAddr=10.0.0.12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node113 NodeAddr=10.0.0.13 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node114 NodeAddr=10.0.0.14 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027676 NodeName=node116 NodeAddr=10.0.0.16 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node118 NodeAddr=10.0.0.18 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node119 NodeAddr=10.0.0.19 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node120 NodeAddr=10.0.0.20 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node121 NodeAddr=10.0.0.21 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node122 NodeAddr=10.0.0.22 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1 NodeName=node123 NodeAddr=10.0.0.23 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node124 NodeAddr=10.0.0.24 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node125 NodeAddr=192.168.0.125 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node126 NodeAddr=10.0.0.26 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node130 NodeAddr=10.0.0.30 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node132 NodeAddr=10.0.0.32 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node133 NodeAddr=10.0.0.33 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node134 NodeAddr=10.0.0.34 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node135 NodeAddr=10.0.0.35 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node137 NodeAddr=10.0.0.37 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node138 NodeAddr=10.0.0.38 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node139 NodeAddr=10.0.0.39 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node140 NodeAddr=10.0.0.40 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node141 NodeAddr=10.0.0.41 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node142 NodeAddr=10.0.0.42 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node143 NodeAddr=10.0.0.43 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node145 NodeAddr=10.0.0.45 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node146 NodeAddr=10.0.0.46 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node147 NodeAddr=10.0.0.47 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node148 NodeAddr=10.0.0.48 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node149 NodeAddr=10.0.0.49 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node151 NodeAddr=10.0.0.51 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 PartitionName=admin Nodes="" MaxTime=INFINITE State=UP AllowGroups=sudo DefMemPerNode=8192 PartitionName=department_only Nodes=ALL MaxTime=3-0 State=UP AllowGroups=sudo,department,group_leaders DefMemPerNode=8192 PartitionName=multicore Nodes=node[106-107] State=UP MaxTime=0-6 DefMemPerNode=1024 MaxNodes=1 PartitionName=fpga Nodes=node[119] State=UP MaxTime=1-0 MaxNodes=1 ExclusiveUser=YES OverSubscribe=EXCLUSIVE DefMemPerNode=257565 PartitionName=students Nodes=node[106-111,118,122,145-149,151] State=UP Default=TRUE MaxTime=1-0 DefMemPerNode=1024 MaxMemPerNode=32768 MaxCPUsPerNode=8 MaxNodes=1
-
Propagate
slurm.conf
to each other node on the cluster using thebash run_ansible_playbook.sh propagate_slurm_conf.yaml
command.The current
propagate_slurm_conf.yaml
playbook -
Run a
sudo scontrol reconfigure
on the controller node to refresh the configuration on each node.
Problems with Slurm?¶
It may happen that Slurm will complain about something.
You can debug almost everything by monitoring the status of the slurmd
daemon with systemctl status slurmd
.
One thing that may happen is that you've set wrong values related to the hardware of the node in the slurm.conf
.
Make sure all values are correct.
Another thing that may happen is that the ID of the slurm
user is invalid.
Anyway, restarting the daemon through systemctl restart slurmd
often solves some problems.