176 lines
4.5 KiB
Markdown
176 lines
4.5 KiB
Markdown
|
> 本文作者:丁辉
|
|||
|
|
|||
|
# Helm部署NVIDIA-K8s-Device-Plugin插件
|
|||
|
|
|||
|
## 介绍
|
|||
|
|
|||
|
**NVIDIA-K8s-Device-Plugin 是一个用于在 Kubernetes 环境中管理和配置 NVIDIA GPU 设备的插件**。这个插件允许集群中的容器应用与 GPU 进行通信和交互,从而能够利用 GPU 的强大计算能力来执行高性能计算任务。
|
|||
|
|
|||
|
## GPU容器化基础环境准备(必做)
|
|||
|
|
|||
|
[请查看此文档](https://gitee.com/offends/Kubernetes/blob/main/GPU/%E5%AE%B9%E5%99%A8%E4%BD%BF%E7%94%A8GPU.md)
|
|||
|
|
|||
|
## 开始部署
|
|||
|
|
|||
|
[Github仓库](https://github.com/NVIDIA/k8s-device-plugin)
|
|||
|
|
|||
|
1. 添加仓库
|
|||
|
|
|||
|
```bash
|
|||
|
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
|
|||
|
helm repo update
|
|||
|
```
|
|||
|
|
|||
|
2. GPU 节点添加标签
|
|||
|
|
|||
|
```bash
|
|||
|
kubectl label nodes ${node} nvidia.com/gpu.present=true
|
|||
|
```
|
|||
|
|
|||
|
3. 部署插件
|
|||
|
|
|||
|
```bash
|
|||
|
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
|
|||
|
--namespace nvidia-device-plugin \
|
|||
|
--create-namespace
|
|||
|
```
|
|||
|
|
|||
|
4. 检查 Node 是否已经识别到 NVIDIA
|
|||
|
|
|||
|
```bash
|
|||
|
kubectl describe node ${node} | grep nvidia
|
|||
|
```
|
|||
|
|
|||
|
|
|||
|
## 卸载
|
|||
|
|
|||
|
卸载 nvidia-device-plugin
|
|||
|
|
|||
|
```bash
|
|||
|
helm uninstall nvidia-device-plugin -n nvidia-device-plugin
|
|||
|
```
|
|||
|
|
|||
|
## 结果测试
|
|||
|
|
|||
|
1. 部署测试容器
|
|||
|
|
|||
|
```bash
|
|||
|
cat <<EOF | kubectl apply -f -
|
|||
|
apiVersion: v1
|
|||
|
kind: Pod
|
|||
|
metadata:
|
|||
|
name: gpu-pod
|
|||
|
spec:
|
|||
|
restartPolicy: Never
|
|||
|
containers:
|
|||
|
- name: cuda-container
|
|||
|
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
|
|||
|
resources:
|
|||
|
limits:
|
|||
|
nvidia.com/gpu: 1 # requesting 1 GPU
|
|||
|
tolerations:
|
|||
|
- key: nvidia.com/gpu
|
|||
|
operator: Exists
|
|||
|
effect: NoSchedule
|
|||
|
EOF
|
|||
|
```
|
|||
|
|
|||
|
2. 检查日志
|
|||
|
|
|||
|
```bash
|
|||
|
kubectl logs gpu-pod
|
|||
|
```
|
|||
|
|
|||
|
> 日志如下即代表 Pod 已可以使用 GPU 资源
|
|||
|
>
|
|||
|
> ```bash
|
|||
|
> [Vector addition of 50000 elements]
|
|||
|
> Copy input data from the host memory to the CUDA device
|
|||
|
> CUDA kernel launch with 196 blocks of 256 threads
|
|||
|
> Copy output data from the CUDA device to the host memory
|
|||
|
> Test PASSED
|
|||
|
> Done
|
|||
|
> ```
|
|||
|
|
|||
|
3. 清理测试 Pod
|
|||
|
|
|||
|
```bash
|
|||
|
kubectl delete pod gpu-pod
|
|||
|
```
|
|||
|
|
|||
|
# GPU 共享访问
|
|||
|
|
|||
|
[官方文档](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#shared-access-to-gpus)
|
|||
|
|
|||
|
NVIDIA 设备插件通过其配置文件中一组扩展选项允许 GPU 的超额分配。有两种可用的共享方式:时间切片和 MPS。
|
|||
|
|
|||
|
注意:时间切片和 MPS 的使用是互斥的。
|
|||
|
|
|||
|
- 在时间切片的情况下,CUDA 时间切片用于允许共享 GPU 的工作负载相互交错。然而,并未采取特殊措施来隔离从同一底层 GPU 获得副本的工作负载,每个工作负载都可以访问 GPU 内存,并在与其他所有工作负载相同的故障域中运行(这意味着如果一个工作负载崩溃,它们全部都会崩溃)。
|
|||
|
|
|||
|
- 在 MPS 的情况下,使用控制守护程序来管理对共享 GPU 的访问。与时间切片相反,MPS 进行空间分区,并允许内存和计算资源被显式地分区,并对每个工作负载强制执行这些限制。
|
|||
|
|
|||
|
## 使用 CUDA 时间切片
|
|||
|
|
|||
|
1. 创建配置文件
|
|||
|
|
|||
|
```yaml
|
|||
|
cat << EOF > /tmp/dp-config.yaml
|
|||
|
version: v1
|
|||
|
sharing:
|
|||
|
timeSlicing:
|
|||
|
resources:
|
|||
|
- name: nvidia.com/gpu
|
|||
|
replicas: 10
|
|||
|
EOF
|
|||
|
```
|
|||
|
|
|||
|
> 如果将此配置应用于具有 8 个 GPU 的节点,则该插件现在将向`nvidia.com/gpu`Kubernetes 通告 80 个资源,而不是 8 个。
|
|||
|
|
|||
|
2. 更新 NVIDIA-K8s-Device-Plugin插件
|
|||
|
|
|||
|
```bash
|
|||
|
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
|
|||
|
--namespace nvidia-device-plugin \
|
|||
|
--create-namespace \
|
|||
|
--set-file config.map.config=/tmp/dp-config.yaml
|
|||
|
```
|
|||
|
|
|||
|
|
|||
|
## 使用 CUDA MPS
|
|||
|
|
|||
|
> 目前在启用了 MIG 的设备上不支持使用 MPS 进行共享
|
|||
|
>
|
|||
|
|
|||
|
1. 创建配置文件
|
|||
|
|
|||
|
```yaml
|
|||
|
cat << EOF > /tmp/dp-config.yaml
|
|||
|
version: v1
|
|||
|
sharing:
|
|||
|
mps:
|
|||
|
resources:
|
|||
|
- name: nvidia.com/gpu
|
|||
|
replicas: 10
|
|||
|
EOF
|
|||
|
```
|
|||
|
|
|||
|
> 如果将此配置应用于具有 8 个 GPU 的节点,则该插件现在将向`nvidia.com/gpu`Kubernetes 通告 80 个资源,而不是 8 个。每块卡会按照 10 分之一的资源来作为 `nvidia.com/gpu: 1` 受用。
|
|||
|
|
|||
|
2. 添加节点标签
|
|||
|
|
|||
|
```bash
|
|||
|
kubectl label nodes ${node} nvidia.com/mps.capable=true
|
|||
|
```
|
|||
|
|
|||
|
3. 更新 NVIDIA-K8s-Device-Plugin插件
|
|||
|
|
|||
|
```bash
|
|||
|
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
|
|||
|
--namespace nvidia-device-plugin \
|
|||
|
--create-namespace \
|
|||
|
--set-file config.map.config=/tmp/dp-config.yaml
|
|||
|
```
|
|||
|
|
|||
|
|