Setup a new node
There are a few steps to follow carefully.
Setup the connection¶
If you've installed a new OS from scratch, you'll need to setup a static IP. To do so:
-
Connect to the IDRAC interface. To do so, you must be connected to the private VPN.
-
Identify the number of the node (e.g.
node102,node125etc.). Let's suppose you want to setupnode125. -
Create a new file in
/etc/netplancalled however you like. Let's take as example00-installer-config.yaml:sudo touch /etc/netplan/00-installer-config.yaml -
Edit the file by adding the following:
network: ethernets: eno4: addresses: - 192.168.0.125/24 gateway4: 192.168.0.52 nameservers: addresses: - 151.100.4.2 - 151.100.4.13 search: - di.uniroma1.it ibp129s0: addresses: - 10.0.0.25/24 version: 2Pay attention to the
192.168.0.125/24line and the10.0.0.25/24, as those are the only lines that you have to tune based on the number of your new node, which in this case is125/25. -
Reboot the system.
If everything went right, you should be able to connect using ssh 192.168.0.125 to the new node.
Setup the storage¶
Setting up the storage should be the very first thing to do when setupping a new node as it contains useful scripts that will help you setup every other component.
The storage nodes are currently node127 and node128, which shared directories are data1 and data2 respectively.
These directories are exported using nfs.
To install nfs on a new node, you can do the following:
- install
nfsif it isn't already installed: - Create the new folders to mount:
- Edit the
/etc/fstabfile by adding the following lines: - Reload
/etc/fstabwith the following command:
If everything went right, at the next login you should boot up into the shared /home/<user>.
Update the OS¶
Currently, every node runs on an Ubuntu LTS distro.
It is pretty straightforward to update Ubuntu following online tutorials, but you can do it running the update_os.sh script.
Script to automatically update the OS
Setup Munge¶
Munge is needed from Slurm to validate credentials. While Munge is already setupped on the controller node, you can install it on another node by using a script.
-
Copy the Munge key from
/etc/munge/munge.keyon the controller node to the node you want to setup: -
Run the
install_munge_on_other_node.shscript on the new node to install Munge and copy the key.Script to automatically install Munge on a new node
If everything went right, you should see a success message similar to the following:
STATUS: Success (0)
ENCODE_HOST: storageserver1 (127.0.1.1)
ENCODE_TIME: 2025-06-11 08:56:11 +0000 (1749632171)
DECODE_TIME: 2025-06-11 08:56:14 +0000 (1749632174)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
Setup Slurm¶
Install Slurm¶
To install Slurm on a new node you can use the install_slurm.sh script by running sudo bash install_slurm.sh <role>, where <role> might be one of controller, compute, login, db.
Usually it would be compute.
Script to automatically install Slurm on a new node
Update slurm.conf¶
Now it is time to update /etc/slurm/slurm.conf on each node to let Slurm know about the new node in the cluster.
To ease things up, it is possible to update just the file on the controller node and then use Ansible to propagate it to all new nodes.
-
Update the Ansible inventory located in
~/ansible/inventory.iniwith the new node.The current Ansible inventory
-
Update the Slurm configuration file located in
/etc/slurm/slurm.confon the controller.The current
slurm.confslurm.conf 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
# slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=di_hpc_salaria SlurmctldHost=node115(10.0.0.15) SlurmctldHost=node104(10.0.0.4) SlurmctldParameters=enable_configless # #MailProg=/bin/mail #MpiDefault= #MpiParams=ports=#-# ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/slurmproc/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/var/slurmproc/slurmctld #SwitchType= TaskPlugin=task/affinity,task/cgroup AuthType=auth/munge # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # # # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=10.0.0.5 AccountingStorageUser=slurm AccountingStoragePort=6819 #JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup #SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log # # PRIORITIES and FAIRSHARE PriorityType=priority/multifactor PriorityWeightFairshare=100000 PriorityWeightAge=1000 PriorityWeightQOS=50000 PriorityWeightJobSize=500 PriorityWeightPartition=10000 PriorityDecayHalfLife=7-0 PriorityCalcPeriod=5:00 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # COMPUTE NODES GresTypes=gpu NodeName=node103 NodeAddr=10.0.0.3 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node106 NodeAddr=10.0.0.6 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node107 NodeAddr=10.0.0.7 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node108 NodeAddr=10.0.0.8 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node109 NodeAddr=10.0.0.9 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node110 NodeAddr=10.0.0.10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1 NodeName=node111 NodeAddr=10.0.0.11 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node112 NodeAddr=10.0.0.12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node113 NodeAddr=10.0.0.13 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node114 NodeAddr=10.0.0.14 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027676 NodeName=node116 NodeAddr=10.0.0.16 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node118 NodeAddr=10.0.0.18 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node119 NodeAddr=10.0.0.19 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node120 NodeAddr=10.0.0.20 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node121 NodeAddr=10.0.0.21 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node122 NodeAddr=10.0.0.22 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1 NodeName=node123 NodeAddr=10.0.0.23 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node124 NodeAddr=10.0.0.24 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node125 NodeAddr=192.168.0.125 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node126 NodeAddr=10.0.0.26 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2 NodeName=node130 NodeAddr=10.0.0.30 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node132 NodeAddr=10.0.0.32 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node133 NodeAddr=10.0.0.33 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node134 NodeAddr=10.0.0.34 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node135 NodeAddr=10.0.0.35 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node137 NodeAddr=10.0.0.37 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node138 NodeAddr=10.0.0.38 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node139 NodeAddr=10.0.0.39 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node140 NodeAddr=10.0.0.40 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node141 NodeAddr=10.0.0.41 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node142 NodeAddr=10.0.0.42 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node143 NodeAddr=10.0.0.43 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node145 NodeAddr=10.0.0.45 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678 NodeName=node146 NodeAddr=10.0.0.46 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 NodeName=node147 NodeAddr=10.0.0.47 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node148 NodeAddr=10.0.0.48 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node149 NodeAddr=10.0.0.49 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 NodeName=node151 NodeAddr=10.0.0.51 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566 PartitionName=admin Nodes="" MaxTime=INFINITE State=UP AllowGroups=sudo DefMemPerNode=8192 PartitionName=department_only Nodes=ALL MaxTime=3-0 State=UP AllowGroups=sudo,department,group_leaders DefMemPerNode=8192 PartitionName=multicore Nodes=node[106-107] State=UP MaxTime=0-6 DefMemPerNode=1024 MaxNodes=1 PartitionName=fpga Nodes=node[119] State=UP MaxTime=1-0 MaxNodes=1 ExclusiveUser=YES OverSubscribe=EXCLUSIVE DefMemPerNode=257565 PartitionName=students Nodes=node[106-111,118,122,145-149,151] State=UP Default=TRUE MaxTime=1-0 DefMemPerNode=1024 MaxMemPerNode=32768 MaxCPUsPerNode=8 MaxNodes=1 -
Propagate
slurm.confto each other node on the cluster using thebash run_ansible_playbook.sh propagate_slurm_conf.yamlcommand.The current
propagate_slurm_conf.yamlplaybook -
Run a
sudo scontrol reconfigureon the controller node to refresh the configuration on each node.
Problems with Slurm?¶
It may happen that Slurm will complain about something.
You can debug almost everything by monitoring the status of the slurmd daemon with systemctl status slurmd.
One thing that may happen is that you've set wrong values related to the hardware of the node in the slurm.conf.
Make sure all values are correct.
Another thing that may happen is that the ID of the slurm user is invalid.
Anyway, restarting the daemon through systemctl restart slurmd often solves some problems.