Skip to content

FAQs

General

How do I access the cluster?

You have to first login to the frontend node and then to the submitter using your account <user>:

ssh -J <user>@151.100.174.45 <user>@submitter

Once in the submitter you can use Slurm.

For details, please refer to this page.

Why do I get "connection timed out" when connecting to the cluster?

Probably you're not connected to the Sapienza network/VPN.

Additionally, remember that if you're connected through Department's VPN you're already on the frontend network, so you can connect straight to the submitter node using ssh <user>@192.168.0.102.

How do I get an account?

Contact an admin at cluster.di@uniroma1.it stating which is your role (student, PhD., researcher, professor etc.).

Eventually, if you have the group_leaders role, you can create a new account using the add_user_hpc command from the frontend.

How do I create an account for a student?

If you have the group_leaders role, you can create a new account using the add_user_hpc command from the frontend.

How do I get department or group_leaders role?

Contact an admin at cluster.di@uniroma1.it stating which is your role (student, PhD., researcher, professor etc.).

If you are a PhD., a researcher or a professor at our Department, then you're eligible to have the department role.

If you are a professor and you lead a laboratory, then you're eligible to have the group_leaders role.

How do I change my password?

Once in the frontend, use the change_password command.

Storage

How do I transfer files from/to the cluster?

Please take a look at this page of the docs.

Jobs

How do I check the resources of each node?

You can do this using the sinfo command with custom formatting.

The basic command is the following:

sinfo

The output is not much detailed though:

PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
admin              up   infinite      3   idle node[120-122]
department_only    up 3-00:00:00     20  alloc node[103,112-114,116,123-124,126,130,132-135,137-143]
students*          up 1-00:00:00     12   idle node[106-111,145-149,151]

If you need to know more specific info about each node, you can use a custom formatter for the output of sinfo. The following formatters will show node names, partitions, number of CPUs per node, MBs of memory per node, GPUs per node, and time limit, and the numbers mean how much character are needed per column:

sinfo --format="%40N %16P %4c %8m %24G %12l"

For a list of all the available formatters, please refer to the Slurm's official docs. The result will then be detailed as will:

NODELIST                                 PARTITION        CPUS MEMORY   GRES                     TIMELIMIT
node120                                  admin            64   257566   gpu:quadro_rtx_6000:2    infinite
node121                                  admin            64   257566   (null)                   infinite
node122                                  admin            64   257566   gpu:quadro_rtx_6000:1    infinite
node[103,114,116,130,132-135,137-143]    department_only  64   257566+  (null)                   3-00:00:00
node[112-113,123-124,126]                department_only  64   257566   gpu:quadro_rtx_6000:2    3-00:00:00
node[106-109,145-149,151]                students*        32+  257566+  (null)                   1-00:00:00
node110                                  students*        64   257566   gpu:quadro_rtx_6000:1    1-00:00:00
node111                                  students*        64   257566   gpu:quadro_rtx_6000:2    1-00:00:00

What are the fairshare politics of the cluster?

Please refer to the priorities and fairshare sections of the current slurm.conf file below and the official Slurm documentation.

slurm.conf
slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=di_hpc_salaria
SlurmctldHost=node115(10.0.0.15)
SlurmctldHost=node104(10.0.0.4)
SlurmctldParameters=enable_configless
#
#MailProg=/bin/mail
#MpiDefault=
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/slurmproc/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/slurmproc/slurmctld
#SwitchType=
TaskPlugin=task/affinity,task/cgroup
AuthType=auth/munge
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=10.0.0.5
AccountingStorageUser=slurm
AccountingStoragePort=6819
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
# PRIORITIES and FAIRSHARE
PriorityType=priority/multifactor
PriorityWeightFairshare=100000       
PriorityWeightAge=1000             
PriorityWeightQOS=50000           
PriorityWeightJobSize=500             
PriorityWeightPartition=10000           
PriorityDecayHalfLife=7-0           
PriorityCalcPeriod=5:00           
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# COMPUTE NODES
GresTypes=gpu

NodeName=node103 NodeAddr=10.0.0.3 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node106 NodeAddr=10.0.0.6 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node107 NodeAddr=10.0.0.7 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node108 NodeAddr=10.0.0.8 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node109 NodeAddr=10.0.0.9 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node110 NodeAddr=10.0.0.10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1
NodeName=node111 NodeAddr=10.0.0.11 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node112 NodeAddr=10.0.0.12 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node113 NodeAddr=10.0.0.13 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node114 NodeAddr=10.0.0.14 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027676
NodeName=node116 NodeAddr=10.0.0.16 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node118 NodeAddr=10.0.0.18 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node119 NodeAddr=10.0.0.19 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node120 NodeAddr=10.0.0.20 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node121 NodeAddr=10.0.0.21 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node122 NodeAddr=10.0.0.22 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:1
NodeName=node123 NodeAddr=10.0.0.23 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node124 NodeAddr=10.0.0.24 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node125 NodeAddr=192.168.0.125 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node126 NodeAddr=10.0.0.26 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566 Gres=gpu:quadro_rtx_6000:2
NodeName=node130 NodeAddr=10.0.0.30 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node132 NodeAddr=10.0.0.32 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node133 NodeAddr=10.0.0.33 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node134 NodeAddr=10.0.0.34 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node135 NodeAddr=10.0.0.35 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node137 NodeAddr=10.0.0.37 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node138 NodeAddr=10.0.0.38 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node139 NodeAddr=10.0.0.39 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node140 NodeAddr=10.0.0.40 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node141 NodeAddr=10.0.0.41 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node142 NodeAddr=10.0.0.42 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node143 NodeAddr=10.0.0.43 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node145 NodeAddr=10.0.0.45 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1027678
NodeName=node146 NodeAddr=10.0.0.46 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=257566
NodeName=node147 NodeAddr=10.0.0.47 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node148 NodeAddr=10.0.0.48 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node149 NodeAddr=10.0.0.49 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566
NodeName=node151 NodeAddr=10.0.0.51 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257566

PartitionName=admin Nodes="" MaxTime=INFINITE State=UP AllowGroups=sudo DefMemPerNode=8192
PartitionName=department_only Nodes=ALL MaxTime=3-0 State=UP AllowGroups=sudo,department,group_leaders DefMemPerNode=8192
PartitionName=multicore Nodes=node[106-107] State=UP MaxTime=0-6 DefMemPerNode=1024 MaxNodes=1
PartitionName=fpga Nodes=node[119] State=UP MaxTime=1-0 MaxNodes=1 ExclusiveUser=YES OverSubscribe=EXCLUSIVE DefMemPerNode=257565
PartitionName=students Nodes=node[106-111,118,122,145-149,151] State=UP Default=TRUE MaxTime=1-0 DefMemPerNode=1024 MaxMemPerNode=32768 MaxCPUsPerNode=8 MaxNodes=1

How do I use NVCC?

To use NVCC (Nvidia CUDA Compiler) you need to be in a node with a GPU. Once there, you need to append some lines to your ~/.bashrc file:

echo "export PATH=/usr/local/cuda-12.8/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc

After that, nvcc will be available.

Check nvcc's version

You can do this with the nvcc --version command:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

How do I use MPI?

Every node has OpenMPI 5.0.7 installed. You can check the details using the ompi_info command.

Please refer to Slurm's official MPI documentation about how to use it.

How do I launch a container?

You can launch containers using either Docker or Singularity. You're encouraged to use Singularity as Docker is known to have issues on HPCs.

To launch a non-interactive container from image <image_name> in the Docker Hub with command <command>, simply run:

srun [requirements] -p <partition> singularity exec docker://<image_name> <command>
Example

For example, to get to know the GPU characteristics of a node from inside a container:

srun --gpus=1 -p students singularity exec --nv docker://pytorch/pytorch:2.6.0-cuda11.8-cudnn9-devel nvidia-smi

It will take a while since the image is big and Singularity must convert from the Docker to Singularity format:

INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
INFO:    Fetching OCI image...
INFO:    Extracting OCI image...
INFO:    Inserting Singularity configuration...
INFO:    Creating SIF file...
Tue Mar 25 11:04:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 6000                Off |   00000000:41:00.0 Off |                  Off |
| 32%   36C    P0             62W /  260W |       1MiB /  24576MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

To launch an interactive container from image <image_name> in the Docker Hub, simply run:

srun [requirements] -p <partition> --pty singularity shell docker://<image_name> 
Example

For example, to connect to bash on a node with a GPU:

srun --gpus=1 -p students singularity shell --nv docker://pytorch/pytorch:2.6.0-cuda11.8-cudnn9-devel

It will take a while since the image is big and Singularity must convert from the Docker to Singularity format. If I were to start a Python terminal, the output will look like this:

INFO:    Using cached SIF image
Singularity> python
Python 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:17:24) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
1
>>>

How do I launch a Visual Studio Code instance?

You can use this command:

srun [requirements] --pty code tunnel --no-sleep

For more details, please take a look at this page of the docs.

Why Visual Studio Code keeps disconnecting?

Visual Studio Code is intended for interactive use. Thus, if your local instance disconnects from the network, also the remote instance will be deleted.