Skip to content

Setup a new node

There are a few steps to follow carefully.

Setup the connection

If you've installed a new OS from scratch, you'll need to setup a static IP. To do so:

  1. Connect to the IDRAC interface. To do so, you must be connected to the private VPN.

  2. Identify the number of the node (e.g. node102, node125 etc.). Let's suppose you want to setup node125.

  3. Create a new file in /etc/netplan called however you like. Let's take as example 00-installer-config.yaml: sudo touch /etc/netplan/00-installer-config.yaml

  4. Edit the file by adding the following:

    network:
        ethernets:
            eno4:
                addresses:
                - 192.168.0.125/24
                gateway4: 192.168.0.52
                nameservers:
                    addresses:
                    - 151.100.4.2
                    - 151.100.4.13
                    search:
                    - di.uniroma1.it
            ibp129s0:
                addresses:
                - 10.0.0.25/24
        version: 2
    

    Pay attention to the 192.168.0.125/24 line and the 10.0.0.25/24, as those are the only lines that you have to tune based on the number of your new node, which in this case is 125/25.

  5. Reboot the system.

If everything went right, you should be able to connect using ssh 192.168.0.125 to the new node.

Setup the storage

Setting up the storage should be the very first thing to do when setupping a new node as it contains useful scripts that will help you setup every other component.

The storage nodes are currently node127 and node128, which shared directories are data1 and data2 respectively. These directories are exported using nfs. To install nfs on a new node, you can do the following:

  1. install nfs if it isn't already installed:
    sudo apt install nfs-common
    
  2. Create the new folders to mount:
    sudo mkdir /data1 /data2
    
  3. Edit the /etc/fstab file by adding the following lines:
    sudo echo -e "10.0.0.27:/data1/data1 /data1 nfs rw,sync,hard,intr,_netdev 0 0" >> /etc/fstab
    sudo echo -e "10.0.0.28:/data2/data2 /data2 nfs rw,sync,hard,intr,_netdev 0 0" >> /etc/fstab
    sudo echo -e "10.0.0.27:/data1/home /home nfs rw,sync,hard,intr,_netdev 0 0" >> /etc/fstab
    
  4. Reload /etc/fstab with the following command:
    sudo mount -a
    

If everything went right, at the next login you should boot up into the shared /home/<user>.

Update the OS

Currently, every node runs on an Ubuntu LTS distro. It is pretty straightforward to update Ubuntu following online tutorials, but you can do it running the update_os.sh script.

Script to automatically update the OS
update_os.sh
1
2
3
4
5
6
#!/bin/bash
set -e

sudo sed -i 's/^Prompt=.*/Prompt=normal/' /etc/update-manager/release-upgrades
sudo apt update -y && sudo apt upgrade -y && sudo apt dist-upgrade -y
sudo do-release-upgrade

Setup Munge

Munge is needed from Slurm to validate credentials. While Munge is already setupped on the controller node, you can install it on another node by using a script.

  1. Copy the Munge key from /etc/munge/munge.key on the controller node to the node you want to setup:

    sudo cp /etc/munge/munge.key ~/munge.key
    
  2. Run the install_munge_on_other_node.sh script on the new node to install Munge and copy the key.

    Script to automatically install Munge on a new node
    install_munge_on_other_node.sh
    #!/bin/bash
    set -e
    
    cp munge.key /etc/munge.key
    apt install munge libmunge2 libmunge-dev
    chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
    chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
    chmod 0755 /run/munge/
    chmod 0700 /etc/munge/munge.key
    chown -R munge: /etc/munge/munge.key
    systemctl enable munge
    systemctl restart munge
    munge -n | sudo ssh guest@192.168.0.127 unmunge
    

If everything went right, you should see a success message similar to the following:

STATUS:           Success (0)
ENCODE_HOST:      storageserver1 (127.0.1.1)
ENCODE_TIME:      2025-06-11 08:56:11 +0000 (1749632171)
DECODE_TIME:      2025-06-11 08:56:14 +0000 (1749632174)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

Setup Slurm

Install Slurm

To install Slurm on a new node you can use the install_slurm.sh script by running sudo bash install_slurm.sh <role>, where <role> might be one of controller, compute, login, db. Usually it would be compute.

Script to automatically install Slurm on a new node
install_slurm.sh
#!/bin/bash
set -e

# arguments parsing
if [[ ! $# == 1 ]] ; then
    echo "You have to supply exactly one argument"
    exit 1
fi

NODE_TYPE=$1

case "$NODE_TYPE" in
    controller|compute|login|db)
        echo "Setting up a $NODE_TYPE node"
        ;;
    *)
        echo "Invalid option. Please enter one of: controller, compute, login, db."
        exit 1
        ;;
esac

# uninstall previous slurm versions and install pre-requisites
apt install -y build-essential fakeroot devscripts equivs

# download slurm
cd /tmp
if [ -d slurm-24.11.1 ]; then
  echo "Removing /tmp/slurm-24.11.1 directory to perform a clean installation"
  rm -rf slurm-24.11.1
fi
wget -O slurm-24.11.1.tar.bz2 https://download.schedmd.com/slurm/slurm-24.11.1.tar.bz2
tar -xaf slurm-24.11.1.tar.bz2

# install slurm
cd slurm-24.11.1
mk-build-deps -i debian/control
debuild -b -uc -us
mkdir /var/slurmproc || true

# install the packages based on the type of node
cd ..
dpkg -i slurm-smd_24.11.1-1_amd64.deb
case "$NODE_TYPE" in
    controller)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb slurm-smd-slurmctld_24.11.1-1_amd64.deb
        ;;
    compute)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb slurm-smd-slurmd_24.11.1-1_amd64.deb
        ;;
    login)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb
        ;;
    db)
        sudo dpkg -i slurm-smd-client_24.11.1-1_amd64.deb slurm-smd-slurmdbd_24.11.1-1_amd64.deb
        ;;
esac
echo "SLURM has been installed successfully"

# removes trash
rm -rf /tmp/slurm*

Update slurm.conf

Now it is time to update /etc/slurm/slurm.conf on each node to let Slurm know about the new node in the cluster.

To ease things up, it is possible to update just the file on the controller node and then use Ansible to propagate it to all new nodes.

  1. Update the Ansible inventory located in ~/ansible/inventory.ini with the new node.

    The current Ansible inventory
    inventory.ini
    [all:vars]
    ansible_connection=ssh
    ansible_ssh_common_args='-o StrictHostKeyChecking=no'
    ansible_user=guest
    
    [cluster]
    192.168.0.[103:116] ib_address=10.0.0.[3:16]
    192.168.0.[119:124] ib_address=10.0.0.[19:24]
    192.168.0.126 ib_address=10.0.0.26
    192.168.0.130 ib_address=10.0.0.30
    192.168.0.[132:135] ib_address=10.0.0.[32:35]
    192.168.0.[137:143] ib_address=10.0.0.[37:43]
    192.168.0.[145:149] ib_address=10.0.0.[45:49]
    192.168.0.151 ib_address=10.0.0.51
    
    [login_nodes]
    192.168.0.102 ib_address=10.0.0.2
    
    [controller_nodes]
    192.168.0.104 ib_address=10.0.0.4
    192.168.0.115 ib_address=10.0.0.15
    
    [db_nodes]
    192.168.0.105 ib_address=10.0.0.5
    
    [compute_nodes]
    192.168.0.103 ib_address=10.0.0.3
    192.168.0.[106:114] ib_address=10.0.0.[6:14]
    192.168.0.116 ib_address=10.0.0.16
    192.168.0.[119:124] ib_address=10.0.0.[19:24]
    192.168.0.126 ib_address=10.0.0.26
    192.168.0.130 ib_address=10.0.0.30
    192.168.0.[132:135] ib_address=10.0.0.[32:35]
    192.168.0.[137:143] ib_address=10.0.0.[37:43]
    192.168.0.[145:149] ib_address=10.0.0.[45:49]
    192.168.0.151 ib_address=10.0.0.51
    
    [nodes_without_gpu]
    192.168.0.[103:109] ib_address=10.0.0.[3:9]
    192.168.0.[114:116] ib_address=10.0.0.[14:16]
    192.168.0.121 ib_address=10.0.0.21
    192.168.0.130 ib_address=10.0.0.30
    192.168.0.[132:135] ib_address=10.0.0.[32:35]
    192.168.0.[137:143] ib_address=10.0.0.[37:43]
    192.168.0.[145:149] ib_address=10.0.0.[45:49]
    192.168.0.151 ib_address=10.0.0.51
    
    [nodes_with_1_gpu]
    192.168.0.110 ib_address=10.0.0.10
    192.168.0.122 ib_address=10.0.0.22
    
    [nodes_with_2_gpus]
    192.168.0.[111:113] ib_address=10.0.0.[11:13]
    192.168.0.120 ib_address=10.0.0.20
    192.168.0.[123:124] ib_address=10.0.0.[23:24]
    192.168.0.126 ib_address=10.0.0.26
    
    [nodes_with_fpga]
    192.168.0.119 ib_address=10.0.0.19
    
    [admin_partition]
    192.168.0.121 ib_address=10.0.0.21
    
    [department_only_partition]
    192.168.0.103 ib_address=10.0.0.3
    192.168.0.[112:114] ib_address=10.0.0.[12:14]
    192.168.0.116 ib_address=10.0.0.16
    192.168.0.120 ib_address=10.0.0.20
    192.168.0.[123:124] ib_address=10.0.0.[23:24]
    192.168.0.126 ib_address=10.0.0.26
    192.168.0.130 ib_address=10.0.0.30
    192.168.0.[132:135] ib_address=10.0.0.[32:35]
    192.168.0.[137:143] ib_address=10.0.0.[37:43]
    
    [multicore_partition]
    192.168.0.[106:107] ib_address=10.0.0.[6:7]
    
    [students_partition]
    192.168.0.[108:111] ib_address=10.0.0.[8:11]
    192.168.0.122 ib_address=10.0.0.22
    192.168.0.[145:149] ib_address=10.0.0.[45:49]
    192.168.0.151 ib_address=10.0.0.51
    
  2. Update the Slurm configuration file located in /etc/slurm/slurm.conf on the controller.

    The current slurm.conf
    slurm.conf
    # slurm.conf file generated by configurator easy.html.
    # Put this file on all nodes of your cluster.
    # See the slurm.conf man page for more information.
    #
    ClusterName=di_hpc_salaria
    SlurmctldHost=node115(10.0.0.15)
    SlurmctldHost=node104(10.0.0.4)
    SlurmctldParameters=enable_configless
    #
    #MailProg=/bin/mail
    #MpiDefault=
    #MpiParams=ports=#-#
    ProctrackType=proctrack/cgroup
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurmd.pid
    #SlurmdPort=6818
    SlurmdSpoolDir=/var/slurmproc/slurmd
    SlurmUser=slurm
    #SlurmdUser=root
    StateSaveLocation=/var/slurmproc/slurmctld
    #SwitchType=
    TaskPlugin=task/affinity,task/cgroup
    AuthType=auth/munge
    #
    #
    # TIMERS
    #KillWait=30
    #MinJobAge=300
    #SlurmctldTimeout=120
    #SlurmdTimeout=300
    #
    #
    #
    #
    # LOGGING AND ACCOUNTING
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=10.0.0.5
    AccountingStorageUser=slurm
    AccountingStoragePort=6819
    #JobAcctGatherFrequency=30
    JobAcctGatherType=jobacct_gather/cgroup
    #SlurmctldDebug=info
    SlurmctldLogFile=/var/log/slurmctld.log
    #SlurmdDebug=info
    SlurmdLogFile=/var/log/slurmd.log
    #
    # PRIORITIES and FAIRSHARE
    PriorityType=priority/multifactor
    PriorityWeightFairshare=100000       
    PriorityWeightAge=1000             
    PriorityWeightQOS=50000           
    PriorityWeightJobSize=500             
    PriorityWeightPartition=10000           
    PriorityDecayHalfLife=7-0           
    PriorityCalcPeriod=5:00           
    SchedulerType=sched/backfill
    SelectType=select/cons_tres
    SelectTypeParameters=CR_Core
    
    # COMPUTE NODES
    GresTypes=gpu
    
    NodeName=node103 NodeAddr=10.0.0.3 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node106 NodeAddr=10.0.0.6 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node107 NodeAddr=10.0.0.7 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node108 NodeAddr=10.0.0.8 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node109 NodeAddr=10.0.0.9 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    NodeName=node110 NodeAddr=10.0.0.10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1
    NodeName=node111 NodeAddr=10.0.0.11 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node112 NodeAddr=10.0.0.12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node113 NodeAddr=10.0.0.13 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node114 NodeAddr=10.0.0.14 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027676
    NodeName=node116 NodeAddr=10.0.0.16 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node118 NodeAddr=10.0.0.18 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    NodeName=node119 NodeAddr=10.0.0.19 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    NodeName=node120 NodeAddr=10.0.0.20 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node121 NodeAddr=10.0.0.21 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node122 NodeAddr=10.0.0.22 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1
    NodeName=node123 NodeAddr=10.0.0.23 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node124 NodeAddr=10.0.0.24 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node125 NodeAddr=192.168.0.125 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node126 NodeAddr=10.0.0.26 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
    NodeName=node130 NodeAddr=10.0.0.30 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node132 NodeAddr=10.0.0.32 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node133 NodeAddr=10.0.0.33 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node134 NodeAddr=10.0.0.34 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node135 NodeAddr=10.0.0.35 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node137 NodeAddr=10.0.0.37 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node138 NodeAddr=10.0.0.38 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node139 NodeAddr=10.0.0.39 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node140 NodeAddr=10.0.0.40 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node141 NodeAddr=10.0.0.41 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
    NodeName=node142 NodeAddr=10.0.0.42 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
    NodeName=node143 NodeAddr=10.0.0.43 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
    NodeName=node145 NodeAddr=10.0.0.45 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
    NodeName=node146 NodeAddr=10.0.0.46 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
    NodeName=node147 NodeAddr=10.0.0.47 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    NodeName=node148 NodeAddr=10.0.0.48 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    NodeName=node149 NodeAddr=10.0.0.49 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    NodeName=node151 NodeAddr=10.0.0.51 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
    
    PartitionName=admin Nodes="" MaxTime=INFINITE State=UP AllowGroups=sudo DefMemPerNode=8192
    PartitionName=department_only Nodes=ALL MaxTime=3-0 State=UP AllowGroups=sudo,department,group_leaders DefMemPerNode=8192
    PartitionName=multicore Nodes=node[106-107] State=UP MaxTime=0-6 DefMemPerNode=1024 MaxNodes=1
    PartitionName=fpga Nodes=node[119] State=UP MaxTime=1-0 MaxNodes=1 ExclusiveUser=YES OverSubscribe=EXCLUSIVE DefMemPerNode=257565
    PartitionName=students Nodes=node[106-111,118,122,145-149,151] State=UP Default=TRUE MaxTime=1-0 DefMemPerNode=1024 MaxMemPerNode=32768 MaxCPUsPerNode=8 MaxNodes=1
    
  3. Propagate slurm.conf to each other node on the cluster using the bash run_ansible_playbook.sh propagate_slurm_conf.yaml command.

    The current propagate_slurm_conf.yaml playbook
    propagate_slurm_conf.yaml
    - name: Propagate slurm.conf
      hosts:
      - controller_nodes
      - login_nodes
      - db_nodes
      - compute_nodes
      become: true
      become_user: root
      tasks:
      - name: Creates /etc/slurm if does not exists
        file:
          path: /etc/slurm
          state: directory
      - name: Copy slurm.conf from central node to other nodes
        copy:
          src: /etc/slurm/slurm.conf
          dest: /etc/slurm/slurm.conf
      - name: Copy gres.conf from central node to other nodes
        copy:
          src: /etc/slurm/gres.conf
          dest: /etc/slurm/gres.conf
      - name: Ensure there is no oci.conf
        file:
          path: /etc/slurm/oci.conf
          state: absent
      - name: Ensure 'slurm' and 'users' group exists
        group:
          name: "{{ item }}"
          state: present
        loop:
        - slurm
        - users
      - name: Set slurm user uid correctly
        shell:
          cmd: sudo usermod -u 1217 slurm
      # - name: Reconfigure Slurm for changes to take effect
      #   shell:
      #     cmd: scontrol reconfigure
      #   delegate_to: "{{ ansible_hostname }}"
    
  4. Run a sudo scontrol reconfigure on the controller node to refresh the configuration on each node.

Problems with Slurm?

It may happen that Slurm will complain about something. You can debug almost everything by monitoring the status of the slurmd daemon with systemctl status slurmd.

One thing that may happen is that you've set wrong values related to the hardware of the node in the slurm.conf. Make sure all values are correct.

Another thing that may happen is that the ID of the slurm user is invalid.

Anyway, restarting the daemon through systemctl restart slurmd often solves some problems.