Install Determined on Slurm/PBS#
This document describes how to deploy Determined on an HPC cluster managed by the Slurm or PBS workload managers.
Tip
Store your installation commands and flags in a shell script for future use, particularly for upgrading.
The Determined master and launcher installation packages are configured for installation on a single login or administrator Slurm/PBS cluster node.
Install Determined Master#
After the node has been selected and the Installation Requirements have been fulfilled and configured, install and configure the Determined master:
Install the on-premises Determined master component (not including the Determined agent) as described in Install Determined Using Linux Packages. Perform the installation and configuration steps, but stop before starting the
determined-masterservice, and continue with the steps below.Install the launcher.
For an RPM-based installation, run:
sudo rpm -ivh hpe-hpc-launcher-<version>.rpm
On Debian distributions, instead run:
sudo apt install ./hpe-hpc-launcher-<version>.deb
The installation configures and enables the
systemdlauncherservice, which provides HPC management capabilities.If launcher dependencies are not satisfied, warning messages are displayed. Install or update missing dependencies or adjust the
pathandld_library_pathin the next step to locate the dependencies.You may verify the installation integrity using the appropriate package manager command. See Package Verification.
Configure and Verify Determined Master on HPC Cluster#
The launcher automatically adds a prototype
resource_managersection for Slurm/PBS if not already present upon startup of the launcher service. Edit the providedresource_managerconfiguration section for your particular deployment. For Linux package-based installations, the configuration file is typically the/etc/determined/master.yamlfile.In this example, with Determined and the launcher colocated on a node named
login, the section might resemble:port: 8080 ... resource_manager: type: slurm master_host: login master_port: 8080 host: localhost port: 8181 protocol: http container_run_type: singularity auth_file: /root/.launcher.token job_storage_root: path: ld_library_path: tres_supported: true slot_type: cuda
The installer provides default values, however, you should explicitly configure the following cluster options:
Option
Description
typeThe cluster workload manager (
slurmorpbs).master_hostThe host name of the Determined master. This is the name the compute nodes will utilize to communicate with the Determined master.
portCommunication port used by the launcher. Update this value if there are conflicts with other services on your cluster.
job_storage_rootShared directory where job-related temporary files are stored. The directory must be visible to the launcher and from the compute nodes. If
user_nameis configured as a user account other thanroot, then the default value is$HOME/.launcher.container_run_typeThe container type to be launched on Slurm (
apptainer,singularity,enroot, orpodman). The default issingularity. Specifysingularitywhen using Apptainer.apptainer_image_rootsingularity_image_rootShared directory on all compute nodes where Apptainer/Singularity images are hosted. Unused unless
container_run_typeissingularity. See Provide a Container Image Cache for details on how this option is used.user_nameandgroup_nameBy default, the launcher runs from the root account. Create a local account and group and update these values to enable running from another account. This account must have access to the Slurm/PBS command line to discover partitions and summarize cluster usage. See HPC Launcher Security Considerations.
pathIf any of the launcher dependencies are not on the default path, you can override the default by updating this value.
gres_supportedIndicates that Slurm/PBS identifies available GPUs. The default is
true. See Slurm Requirements or PBS Requirements for details.See the slurm/pbs section of the cluster configuration reference for the full list of configuration options.
After changing values in the
resource_managersection of the/etc/determined/master.yamlfile, restart the launcher service:sudo systemctl restart launcher
Verify successful launcher startup using the
systemctl status launchercommand. If the launcher fails to start, check system log diagnostics, such as/var/log/messagesorjournalctl --since="10 minutes ago" -u launcher, make the needed changes to the/etc/determined/master.yamlfile, and restart the launcher.If the installer reported incorrect dependencies, verify that they have been resolved by changes to the
pathandld_library_pathin the previous step:sudo /etc/launcher/scripts/check-dependencies.shReload the Determined master to get the updated configuration:
sudo systemctl restart determined-master
Verify successful determined-master startup using the
systemctl status determined-mastercommand. If the launcher fails to start, check system log diagnostics, such as/var/log/messagesorjournalctl --since="10 minutes ago" -u determined-master, make the needed changes to the/etc/determined/master.yamlfile, and restart the determined-master.If the compute nodes of your cluster do not have internet connectivity to download Docker images, see Provide a Container Image Cache.
If internet connectivity requires use of a proxy, make sure the proxy variables are defined as per Proxy Configuration Requirements.
Log into Determined, see User Accounts. The Determined user must be linked to a user on the HPC cluster. If signed in with a Determined administrator account, the following example creates a Determined user account that is linked to the current user’s Linux account.
det user create $USER det user link-with-agent-user --agent-uid $(id -u) --agent-gid $(id -g) --agent-user $USER --agent-group $(id -gn) $USER det user login $USER
Note
If an agent user has not been configured for a Determined username, jobs will run as user root. For more details see Run Tasks as Specific Agent Users.
Verify the configuration by sanity-checking your Determined configuration:
det command run hostname
A successful configuration reports the hostname of the compute node selected by Slurm to run the job.
Run a simple distributed training job such as the PyTorch MNIST Tutorial to verify that it completes successfully. This validates Determined master and launcher communication, access to the shared filesystem, GPU scheduling, and highspeed interconnect configuration. For more complete validation, ensure that the
slots_per_trialis at least twice the number of GPUs available on a single node.