Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,21 @@ It is recommended to use the latest release branch for stable code (linked above

The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include:

- NVIDIA DGX OS 4, 5
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS
- NVIDIA DGX OS 4, 5, 6, 7
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS, 24.04 LTS
- CentOS 7, 8

### Cluster System

The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required.

- NVIDIA DGX OS 4, 5
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS
- NVIDIA DGX OS 4, 5, 6, 7
- Ubuntu 18.04 LTS, 20.04, 22.04 LTS, 24.04 LTS
- CentOS 7, 8
- Red Hat Enterprise Linux / Rocky Linux 8 and 9 for the DGX software stack through the `nvidia-dgx` role

You may also install a supported operating system on all servers via a 3rd-party solution (i.e. [MAAS](https://maas.io/), [Foreman](https://www.theforeman.org/)) or utilize the provided [OS install container](docs/pxe/minimal-pxe-container.md).
For DGX platform software installation on top of vanilla Ubuntu or Red Hat family operating systems, see the [DGX software stack role guide](docs/deepops/dgx-software-stack.md).

### Kubernetes

Expand All @@ -77,7 +79,7 @@ For more information on Slurm in general, refer to the [official Slurm docs](htt
[NVIDIA Bright Cluster Manager](https://www.brightcomputing.com/brightclustermanager) is recommended as an enterprise solution which enables managing multiple workload managers within a single cluster, including Kubernetes, Slurm, Univa Grid Engine, and PBS Pro.

**DeepOps does not test or support a configuration where nodes have a heterogenous OS running.**
Additional modifications are needed if you plan to use unsupported operating systems such as RHEL.
The `nvidia-dgx` role can install NVIDIA DGX platform software on supported DGX systems running Red Hat Enterprise Linux / Rocky Linux 8 or 9; broader Kubernetes or Slurm cluster support on RHEL still requires site-specific validation.

### Virtual

Expand Down
132 changes: 132 additions & 0 deletions docs/deepops/dgx-software-stack.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# DGX Software Stack Role

The `nvidia-dgx` role installs NVIDIA DGX platform software on supported DGX
systems after a base operating system has been installed.

This role is intended for DGX hardware only. It checks the system product name
and stops on non-DGX systems.

## Supported Paths

The role has two implementation paths:

| Base OS | DGX software path | Notes |
| ------- | ----------------- | ----- |
| Ubuntu 18.04 | DGX OS 4 legacy packages | Existing legacy role path. |
| Ubuntu 20.04 | DGX OS 5 legacy packages | Existing legacy role path. |
| Ubuntu 22.04 | DGX OS 6 software stack | Uses the official DGX OS 6 repository archive and system-specific packages. |
| Ubuntu 24.04 | DGX OS 7 software stack | Uses the official DGX OS 7 repository archive and unified `nvidia-system-*` packages. |
| Red Hat Enterprise Linux 7 | Legacy DGX EL7 packages | Existing legacy role path. |
| Red Hat Enterprise Linux 8 / Rocky Linux 8 | DGX Software for RHEL 8 | Uses the official NVIDIA repository setup RPM and DGX configuration groups. |
| Red Hat Enterprise Linux 9 / Rocky Linux 9 | DGX Software for RHEL 9 | Uses the official NVIDIA repository setup RPM and DGX configuration groups. |

The EL8 work addresses GitHub issue
[#1120](https://github.com/NVIDIA/deepops/issues/1120).

## Official References

- [Installing DGX Software on Ubuntu](https://docs.nvidia.com/dgx/dgx-os-6-user-guide/installing_on_ubuntu.html)
- [Customizing Ubuntu Installation with DGX Software](https://docs.nvidia.com/dgx/dgx-os-7-user-guide/installing_on_ubuntu.html)
- [DGX Software for Red Hat Enterprise Linux 8 Installation Guide](https://docs.nvidia.com/dgx/dgx-rhel8-install-guide/index.html)
- [DGX Software for Red Hat Enterprise Linux 8 Release Notes](https://docs.nvidia.com/dgx/dgx-rhel8-sw-release-notes/index.html)
- [DGX Software for Red Hat Enterprise Linux 9 User Guide](https://docs.nvidia.com/dgx/dgx-el9-user-guide/index.html)

## Ubuntu 22.04 / DGX OS 6

The role follows the DGX OS 6 guide:

1. Install the DGX repository files from
`https://repo.download.nvidia.com/baseos/ubuntu/jammy/dgx-repo-files.tgz`.
2. Install the system-specific DGX configuration and tools packages.
3. Install `linux-tools-nvidia` and `nvidia-peermem-loader`.
4. Optionally install the NVIDIA driver, Docker/NVIDIA Container Toolkit, NVSM,
serial-over-LAN, logrotate, and additional DGX OS administration/development
packages.

The default driver branch is `550`, matching the DGX OS 6 examples. Override it
when needed:

```yaml
dgx_os6_driver_branch: "580"
```

Disruptive package upgrades are opt-in:

```yaml
dgx_os6_upgrade_packages: true
```

## Red Hat Enterprise Linux 8 and 9

The role follows the official Red Hat DGX software guides:

1. Optionally enable the required Red Hat subscription repositories on RHEL.
This is skipped automatically on Rocky Linux.
2. Install the NVIDIA DGX repository setup RPM for EL8 or EL9.
3. Install the DGX configuration group for the detected DGX platform.
4. Optionally install the NVIDIA driver module and support packages.
5. Optionally install Docker CE and the NVIDIA Container Runtime group.

The default driver stream uses DKMS on EL8 so current EL8 minor kernels can
build a matching NVIDIA kernel module: `525-dkms` on most EL8 systems,
`535-dkms` on EL8 DGX H100, and `580` on EL9. EL9 NVSwitch systems install the
open-kernel-module stream by default. Override the branch when a validated DGX
release note calls for another stream:

```yaml
dgx_redhat_driver_branch: "580"
```

RHEL subscription repository management is enabled by default only when
`ansible_distribution == 'RedHat'`. Disable it if subscriptions are managed
outside DeepOps:

```yaml
dgx_redhat_manage_subscription_repos: false
```

Disruptive `dnf update --nobest` behavior is opt-in:

```yaml
dgx_redhat_upgrade_packages: true
```

## Ubuntu 24.04 / DGX OS 7

The role follows the DGX OS 7 guide:

1. Install the architecture-specific DGX OS 7 repository archive from
`https://repo.download.nvidia.com/baseos/ubuntu/noble/`.
2. Install the unified DGX OS 7 metapackages: `nvidia-system-core`,
`nvidia-system-utils`, and `nvidia-system-extra`.
3. Install `nvidia-system-station` for DGX Station and DGX Spark systems.
4. Install kernel tools and `nvidia-peermem-loader`.
5. Optionally install the Release 580 open GPU kernel module driver packages,
including Fabric Manager, NVLSM/NVSDM, or IMEX packages for the DGX platform
that requires them.

Disruptive package upgrades are opt-in:

```yaml
dgx_os7_upgrade_packages: true
```

## Validation

Full validation requires real DGX hardware and access to NVIDIA/OS package
repositories. At minimum, run syntax validation before opening a PR:

```bash
ansible-playbook --syntax-check playbooks/nvidia-dgx/nvidia-dgx.yml
```

On hardware, validate the role with the target OS and DGX model, reboot if the
driver was installed, then verify:

```bash
nvidia-smi
sudo docker run --gpus=all --rm nvcr.io/nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
```

Use the RHEL UBI CUDA image from the official guide when validating the RHEL
path.
4 changes: 2 additions & 2 deletions docs/deepops/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ A short description of the nightly testing is outlined below. The full suit of t
| Ubuntu 20.04 | | x | x | |
| CentOS 7 | | x | x | |
| CentOS | | | x | |
| DGX OS | | | | No automated testing support |
| RHEL | | | | No testing support |
| DGX OS | | | | Syntax-checked only; full validation requires DGX hardware |
| RHEL | | | | DGX software-stack role syntax-checked only; full validation requires DGX hardware and subscriptions |
| 1 mgmt node | x | x | | |
| 3 mgmt nodes | | | x | |
| 1 gpu node | x | x | | |
Expand Down
4 changes: 2 additions & 2 deletions roles/nvidia-dgx/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
msg: "Role supports DGX systems only"
when: ansible_product_name is not search("DGX")

- name: Ubuntu tasks for DGX OS 4/5
- name: Ubuntu tasks for DGX Software Stack
include_tasks: ubuntu.yml
when:
- ansible_distribution == 'Ubuntu'

- name: redhat family tasks
- name: Red Hat family tasks for DGX Software Stack
include_tasks: redhat.yml
when: ansible_os_family == 'RedHat'

Expand Down
147 changes: 147 additions & 0 deletions roles/nvidia-dgx/tasks/redhat-el8-plus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
- name: Determine Red Hat DGX platform package set
set_fact:
dgx_redhat_platform: "{{ item }}"
loop: "{{ dgx_redhat_platforms }}"
when:
- ansible_product_name is search(item.match)
- ansible_distribution_major_version in item.supported_major_versions

- name: Fail if Red Hat DGX platform package set is unknown
fail:
msg: "Unsupported DGX model for EL{{ ansible_distribution_major_version }} DGX role path: {{ ansible_product_name }}"
when: dgx_redhat_platform is not defined

- name: Enable Red Hat subscription repositories for DGX Software Stack
command: "subscription-manager repos --enable={{ item }}"
loop: "{{ dgx_redhat_subscription_repos[ansible_distribution_major_version] }}"
changed_when: false
when:
- dgx_redhat_manage_subscription_repos
- ansible_distribution == 'RedHat'

- name: Enable Rocky Linux CRB repository for DGX Software Stack
command: dnf config-manager --set-enabled crb
changed_when: false
when:
- ansible_distribution == 'Rocky'
- ansible_distribution_major_version == '9'

- name: Install NVIDIA DGX repository setup package
dnf:
name: "{{ dgx_redhat_repo_setup_rpms[ansible_distribution_major_version] }}"
state: present
disable_gpg_check: yes

- name: Upgrade Red Hat DGX Software Stack packages
dnf:
name: "*"
state: latest
nobest: yes
when: dgx_redhat_upgrade_packages
tags: skip_ansible_lint

- name: Install kernel development packages for DGX driver builds
dnf:
name:
- "kernel-devel-{{ ansible_kernel }}"
- "kernel-headers-{{ ansible_kernel }}"
state: present
when: ansible_distribution_major_version == '9' or 'dkms' in dgx_redhat_driver_branch

- name: Install Red Hat DGX configuration group
dnf:
name: "@{{ dgx_redhat_platform.configuration_group }}"
state: present

- name: Configure Red Hat DGX driver facts
set_fact:
dgx_redhat_driver_profile: "{{ 'fm' if dgx_redhat_platform.nvswitch else 'default' }}"
dgx_redhat_driver_stream: "{{ dgx_redhat_driver_branch }}{{ '-open' if dgx_redhat_use_open_kernel_modules else '' }}"

- name: Configure Red Hat DGX driver profile list
set_fact:
dgx_redhat_driver_profiles: >-
{{
[dgx_redhat_driver_profile, 'src']
if ansible_distribution_major_version == '8'
and not dgx_redhat_use_open_kernel_modules
and 'dkms' not in dgx_redhat_driver_stream
else [dgx_redhat_driver_profile]
}}

- name: Configure Red Hat DGX driver module specifications
set_fact:
dgx_redhat_driver_module_specs: "{{ dgx_redhat_driver_profiles | map('regex_replace', '^(.*)$', 'nvidia-driver:' ~ dgx_redhat_driver_stream ~ '/\\1') | list }}"

- name: Remove precompiled NVIDIA kmod headers before DKMS driver install
dnf:
name: nvidia-kmod-headers
state: absent
when:
- dgx_redhat_install_driver
- "'dkms' in dgx_redhat_driver_stream"

- name: Install Red Hat DGX NVIDIA driver module
command: "dnf module install --nobest -y {{ item }}"
loop: "{{ dgx_redhat_driver_module_specs }}"
register: dgx_redhat_driver_module_install
changed_when: "'Nothing to do' not in dgx_redhat_driver_module_install.stdout"
notify: reboot after driver install
when: dgx_redhat_install_driver

- name: Mark Red Hat DGX driver install state for reboot handler
set_fact:
install_driver:
changed: "{{ dgx_redhat_driver_module_install.changed | default(false) }}"
when: dgx_redhat_install_driver

- name: Install Red Hat DGX driver support packages
dnf:
name: "{{ dgx_redhat_driver_support_packages + (dgx_redhat_fabricmanager_packages if dgx_redhat_platform.nvswitch else dgx_redhat_non_nvswitch_packages) + (dgx_redhat_nvlink5_packages if dgx_redhat_platform.nvlink5 else []) }}"
state: present
when: dgx_redhat_install_driver

- name: Install Red Hat DGX Station extra driver packages
dnf:
name: "{{ dgx_redhat_station_packages }}"
state: present
when:
- dgx_redhat_install_driver
- dgx_redhat_platform.station

- name: Install Red Hat DGX Docker CE
dnf:
name: docker-ce
state: present
allowerasing: yes
when:
- dgx_redhat_install_container_runtime
- dgx_redhat_install_docker_ce

- name: Install Red Hat DGX NVIDIA Container Runtime group
dnf:
name: "@NVIDIA Container Runtime"
state: present
allowerasing: yes
notify:
- restart docker
when: dgx_redhat_install_container_runtime

- name: Install Red Hat DGX optional cachefilesd configuration
dnf:
name: nvidia-conf-cachefilesd
state: present
when: dgx_redhat_install_cachefilesd

- name: Populate Red Hat DGX service facts
service_facts:

- name: Enable available Red Hat DGX services
systemd:
name: "{{ item }}"
state: started
enabled: yes
daemon_reload: yes
loop: "{{ dgx_redhat_services + (dgx_redhat_fabricmanager_services if dgx_redhat_platform.nvswitch else []) }}"
when: item in ansible_facts.services
Loading
Loading