Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 58 additions & 50 deletions docs/airgap/mirror-apt-repos.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Set up offline repositoriy mirrors for Aptitude
- [NVIDIA CUDA repository](#nvidia-cuda-repository)
- [APT Configuration](#apt-configuration-1)
- [GPG Key Validation](#gpg-key-validation-1)
- [nvidia-docker](#nvidia-docker)
- [NVIDIA Container Toolkit](#nvidia-container-toolkit)
- [APT Configuration](#apt-configuration-2)
- [GPG Key Validation](#gpg-key-validation-2)
- [Additional DEB packages](#additional-deb-packages)
Expand Down Expand Up @@ -57,7 +57,7 @@ deb http://archive.ubuntu.com/ubuntu/ <release-name>-updates main multiverse uni
```

where `<release-name>` is the name of the Ubuntu release you want to mirror.
This is `bionic` for Ubuntu 18.04, and `focal` for Ubuntu 20.04.
This is `bionic` for Ubuntu 18.04, `focal` for Ubuntu 20.04, `jammy` for Ubuntu 22.04, and `noble` for Ubuntu 24.04.

### Docker repository

Expand All @@ -74,7 +74,7 @@ https://download.docker.com/linux/ubuntu/gpg
```

where `<release-name>` is the name of the Ubuntu release you want to mirror.
This is `bionic` for Ubuntu 18.04, and `focal` for Ubuntu 20.04.
This is `bionic` for Ubuntu 18.04, `focal` for Ubuntu 20.04, `jammy` for Ubuntu 22.04, and `noble` for Ubuntu 24.04.

### NVIDIA CUDA repository

Expand All @@ -92,6 +92,18 @@ deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804
deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004
```

**Ubuntu 22.04**

```bash
deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 /
```

**Ubuntu 24.04**

```bash
deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 /
```

#### GPG Key Validation

**Ubuntu 18.04**
Expand All @@ -106,30 +118,30 @@ https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2a
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
```

### nvidia-docker
**Ubuntu 22.04**

#### APT Configuration
```bash
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
```

**Ubuntu 18.04**
**Ubuntu 24.04**

```bash
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH)
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/3bf863cc.pub
```

**Ubuntu 20.04**
### NVIDIA Container Toolkit

#### APT Configuration

```bash
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH)
deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
```

#### GPG Key Validation

```bash
https://nvidia.github.io/nvidia-docker/gpgkey
https://nvidia.github.io/libnvidia-container/gpgkey
```

### Additional DEB packages
Expand Down Expand Up @@ -161,7 +173,7 @@ After installing `apt-mirror`, edit the `/etc/apt/mirror.list` file make the fol
- Set the `base_path` to the desired download path for your mirror (here, `/var/repos`)
- Add a list of APT configuration lines for each repo you wish to mirror

For example, if we just want to mirror the Docker and NVIDIA Docker repositories, this configuration would work:
For example, if we just want to mirror the Docker and NVIDIA Container Toolkit repositories, this configuration would work:

```
############# config ##################
Expand All @@ -172,11 +184,11 @@ set _tilde 0
#
############# end config ##############

deb https://download.docker.com/linux/ubuntu bionic stable
deb https://nvidia.github.io/nvidia-docker/ubuntu20.04/amd64 /
deb https://download.docker.com/linux/ubuntu noble stable
deb https://nvidia.github.io/libnvidia-container/stable/deb/amd64 /
```

The full mirror.list file for Deepops:
The full mirror.list file for DeepOps:

```
############# config ##################
Expand All @@ -195,29 +207,27 @@ set _tilde 0
#
############# end config ##############

deb http://archive.ubuntu.com/ubuntu focal main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-security main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-proposed main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
deb http://ppa.launchpad.net/maas/2.9/ubuntu focal main
deb http://archive.canonical.com/ubuntu focal partner
deb http://archive.ubuntu.com/ubuntu noble main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-security main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-updates main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-proposed main restricted universe multiverse
deb http://archive.ubuntu.com/ubuntu noble-backports main restricted universe multiverse
deb http://ppa.launchpad.net/maas/3.5/ubuntu noble main
deb http://archive.canonical.com/ubuntu noble partner

deb-src http://archive.ubuntu.com/ubuntu focal main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-security main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-updates main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-proposed main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu focal-backports main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu noble main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu noble-security main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu noble-updates main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu noble-proposed main restricted universe multiverse
deb-src http://archive.ubuntu.com/ubuntu noble-backports main restricted universe multiverse

deb https://download.docker.com/linux/ubuntu focal stable
deb https://nvidia.github.io/nvidia-docker/ubuntu20.04/amd64 /
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu20.04/amd64 /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu20.04/amd64 /
deb https://download.docker.com/linux/ubuntu noble stable
deb https://nvidia.github.io/libnvidia-container/stable/deb/amd64 /

deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 /
deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 /

deb http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal common dgx
deb http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal-updates common dgx
deb http://repo.download.nvidia.com/baseos/ubuntu/noble/x86_64/ noble common dgx
deb http://repo.download.nvidia.com/baseos/ubuntu/noble/x86_64/ noble-updates common dgx

clean http://archive.ubuntu.com/ubuntu
clean https://download.docker.com
Expand Down Expand Up @@ -266,10 +276,10 @@ sudo mkdir /var/www/html/repos

Then, from the extracted mirror directory,
copy the directories for each repository into the web root.
For example, assuming the extracted mirror directory is `/var/repos` and the repository is `nvidia-docker`:
For example, assuming the extracted mirror directory is `/var/repos` and the repository is `libnvidia-container`:

```bash
sudo cp -r /var/repos/mirror/nvidia.github.com/nvidia-docker/ /var/www/html/repos/nvidia-docker/
sudo cp -r /var/repos/mirror/nvidia.github.io/libnvidia-container/ /var/www/html/repos/libnvidia-container/
```

At this point, the downloaded package repositories should be available on your offline network via the package server.
Expand All @@ -278,29 +288,27 @@ You can then add these downloaded repos to the `/etc/apt/sources.list` configura
Line added to `/etc/apt/sources.list`:

```
deb http://repo-server/ubuntu focal main restricted universe multiverse
deb http://repo-server/ubuntu focal-updates main restricted universe multiverse
deb http://repo-server/ubuntu focal-backports main restricted universe multiverse
deb http://repo-server/ubuntu focal-security main restricted universe multiverse
deb http://repo-server/ubuntu noble main restricted universe multiverse
deb http://repo-server/ubuntu noble-updates main restricted universe multiverse
deb http://repo-server/ubuntu noble-backports main restricted universe multiverse
deb http://repo-server/ubuntu noble-security main restricted universe multiverse
```

Lines added to `/etc/apt/sources.list.d/dgx.list`:

```
deb http://repo-server/baseos/ubuntu/focal/x86_64/ focal common dgx
deb http://repo-server/baseos/ubuntu/focal/x86_64/ focal-updates common dgx
deb http://repo-server/baseos/ubuntu/noble/x86_64/ noble common dgx
deb http://repo-server/baseos/ubuntu/noble/x86_64/ noble-updates common dgx
```

Lines added to `/etc/apt/sources.list.d/cuda-compute-repo.list`:

```
deb http://repo-server/cuda/repos/ubuntu2004/x86_64/ /
deb http://repo-server/cuda/repos/ubuntu2404/x86_64/ /
```

Lines add to `/etc/apt/sources.list.d/nvidia-docker.list`:
Lines add to `/etc/apt/sources.list.d/nvidia-container-toolkit.list`:

```
deb [trusted=yes] http://repo-server/libnvidia-container/stable/ubuntu20.04/amd64 /
deb [trusted=yes] http://repo-server/nvidia-container-runtime/stable/ubuntu20.04/amd64 /
deb [trusted=yes] http://repo-server/nvidia-docker/ubuntu20.04/amd64 /
deb [trusted=yes] http://repo-server/libnvidia-container/stable/deb/amd64 /
```
6 changes: 3 additions & 3 deletions docs/airgap/ngc-ready.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The following Apt repositories will need to be mirrored in the offline environme

- Ubuntu distribution repositories
- Docker CE repository
- nvidia-docker repositories
- NVIDIA container runtime repositories

For instructions on mirroring these repositories, see the [doc on Apt mirrors](./mirror-apt-repos.md).

Expand All @@ -49,7 +49,7 @@ The following RPM repositories will need to be mirrored in the offline environme

- Enterprise Linux distribution repositories (RHEL or CentOS, depending on your distro)
- Docker CE repository
- nvidia-docker repositories
- NVIDIA container runtime repositories

For instructions on mirroring these repositories, see the [doc on RPM mirrors](./mirror-rpm-repos.md).

Expand Down Expand Up @@ -123,7 +123,7 @@ In all cases, you should edit the URLs appropriately to ensure they can download

### Configure DeepOps to use your mirrors for non-distribution package repositories

The NGC-Ready playbook depends on the Docker CE and nvidia-docker package repositories.
The NGC-Ready playbook depends on the Docker CE and NVIDIA container runtime package repositories.
DeepOps sets up these repositories automatically during the installation.

To configure alternate URLs for these repositories, set the following variables in your DeepOps configuration:
Expand Down
11 changes: 10 additions & 1 deletion docs/deepops/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ A short description of the nightly testing is outlined below. The full suit of t
| --------------------------------------------------- | ----------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- | ------------------------------------ |
| Ubuntu 18.04 | x | x | x | |
| Ubuntu 20.04 | | x | x | |
| Ubuntu 22.04 | | | | setup.sh and Molecule GitHub Actions |
| Ubuntu 24.04 | | | | setup.sh and Molecule GitHub Actions |
| CentOS 7 | | x | x | |
| CentOS | | | x | |
| DGX OS | | | | Syntax-checked only; full validation requires DGX hardware |
Expand Down Expand Up @@ -118,7 +120,8 @@ molecule init scenario -r <your-role> --driver-name docker
```

4. In the file `molecule/default/molecule.yml`, define the list of platforms to be tested.
DeepOps currently supports operating systems based on Ubuntu 18.04, Ubuntu 20.04, EL7, and EL8.
DeepOps currently supports operating systems based on Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04, EL7, and EL8.
The DGX software stack role also supports Red Hat Enterprise Linux / Rocky Linux 8 and 9 for DGX platform software installation.
To test these stacks, the following `platforms` stanza can be used.

```yaml
Expand All @@ -129,6 +132,12 @@ platforms:
- name: ubuntu-2004
image: geerlingguy/docker-ubuntu2004-ansible
pre_build_image: true
- name: ubuntu-2204
image: geerlingguy/docker-ubuntu2204-ansible
pre_build_image: true
- name: ubuntu-2404
image: geerlingguy/docker-ubuntu2404-ansible
pre_build_image: true
- name: centos-7
image: geerlingguy/docker-centos7-ansible
pre_build_image: true
Expand Down
2 changes: 2 additions & 0 deletions docs/ngc-ready/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ These instructions assume the following:
- You have a NGC-Ready server. To determine if your server is NGC-Ready, please review the list of validated servers at the NGC-Ready Server documentation page - https://docs.nvidia.com/certification-programs/ngc-ready-systems/index.html
- Your NGC-Ready Server has a compatible Linux distribution installed:
- Ubuntu Server 20.04 LTS
- Ubuntu Server 22.04 LTS
- Ubuntu Server 24.04 LTS
- CentOS 7

## Setup
Expand Down
2 changes: 1 addition & 1 deletion docs/slurm-cluster/slurm-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Single Node Slurm Deployment Guide

## Introduction

The general requirements and procedure for Slurm setup via deepops is documented in the [README.md](README.md) for the slurm-cluster. The instructions below outline the steps to deviate from the general setup to enable single node DeepOps Slurm setup. The machine on which Slurm is being deployed should be up to date in a stable state with GPU drivers already installed and functional. The supported Operating Systems are Ubuntu (version 18 and 20), CentOS and RHEL (version 7 and 8 albeit version 8 is preferred).
The general requirements and procedure for Slurm setup via deepops is documented in the [README.md](README.md) for the slurm-cluster. The instructions below outline the steps to deviate from the general setup to enable single node DeepOps Slurm setup. The machine on which Slurm is being deployed should be up to date in a stable state with GPU drivers already installed and functional. The supported operating systems are Ubuntu 18.04, 20.04, 22.04, and 24.04; CentOS 7 and 8; and RHEL 7 and 8, with RHEL 8 preferred among the RHEL paths.

## Deployment Procedure

Expand Down
9 changes: 9 additions & 0 deletions playbooks/container/nvidia-docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,19 @@
state: absent
when: docker_install | default('yes')

- name: install NVIDIA Container Toolkit on Ubuntu 24.04 and newer
include_role:
name: nvidia_container_toolkit
when:
- ansible_local['gpus']['count'] and ansible_distribution == "Ubuntu"
- ansible_distribution_version is version('24.04', '>=')
- docker_install | default('yes')

- name: install nvidia-docker
include_role:
name: nvidia.nvidia_docker
when:
- ansible_local['gpus']['count'] and (ansible_distribution == "Ubuntu" or ansible_os_family == "RedHat")
- not (ansible_distribution == "Ubuntu" and ansible_distribution_version is version('24.04', '>='))
- docker_install | default('yes')
environment: "{{ proxy_env if proxy_env is defined else {} }}"
10 changes: 10 additions & 0 deletions roles/nvidia_container_toolkit/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
nvidia_container_toolkit_repo_base_url: "https://nvidia.github.io/libnvidia-container"
nvidia_container_toolkit_repo_gpg_url: "{{ nvidia_container_toolkit_repo_base_url }}/gpgkey"
nvidia_container_toolkit_keyring_ascii_path: "/usr/share/keyrings/nvidia-container-toolkit-keyring.asc"
nvidia_container_toolkit_keyring_path: "/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg"
nvidia_container_toolkit_apt_source_path: "/etc/apt/sources.list.d/nvidia-container-toolkit.list"
nvidia_container_toolkit_package: "nvidia-container-toolkit"
nvidia_container_toolkit_configure_docker: true
nvidia_container_toolkit_set_as_default_runtime: true
nvidia_container_toolkit_restart_docker: true
6 changes: 6 additions & 0 deletions roles/nvidia_container_toolkit/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
- name: restart docker
ansible.builtin.service:
name: docker
state: restarted
when: nvidia_container_toolkit_restart_docker | bool
Loading
Loading