176 lines
4.5 KiB
Markdown
176 lines
4.5 KiB
Markdown
> 本文作者:丁辉
|
||
|
||
# Helm部署NVIDIA-K8s-Device-Plugin插件
|
||
|
||
## 介绍
|
||
|
||
**NVIDIA-K8s-Device-Plugin 是一个用于在 Kubernetes 环境中管理和配置 NVIDIA GPU 设备的插件**。这个插件允许集群中的容器应用与 GPU 进行通信和交互,从而能够利用 GPU 的强大计算能力来执行高性能计算任务。
|
||
|
||
## GPU容器化基础环境准备(必做)
|
||
|
||
[请查看此文档](https://gitee.com/offends/Kubernetes/blob/main/GPU/%E5%AE%B9%E5%99%A8%E4%BD%BF%E7%94%A8GPU.md)
|
||
|
||
## 开始部署
|
||
|
||
[Github仓库](https://github.com/NVIDIA/k8s-device-plugin)
|
||
|
||
1. 添加仓库
|
||
|
||
```bash
|
||
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
|
||
helm repo update
|
||
```
|
||
|
||
2. GPU 节点添加标签
|
||
|
||
```bash
|
||
kubectl label nodes ${node} nvidia.com/gpu.present=true
|
||
```
|
||
|
||
3. 部署插件
|
||
|
||
```bash
|
||
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
|
||
--namespace nvidia-device-plugin \
|
||
--create-namespace
|
||
```
|
||
|
||
4. 检查 Node 是否已经识别到 NVIDIA
|
||
|
||
```bash
|
||
kubectl describe node ${node} | grep nvidia
|
||
```
|
||
|
||
|
||
## 卸载
|
||
|
||
卸载 nvidia-device-plugin
|
||
|
||
```bash
|
||
helm uninstall nvidia-device-plugin -n nvidia-device-plugin
|
||
```
|
||
|
||
## 结果测试
|
||
|
||
1. 部署测试容器
|
||
|
||
```bash
|
||
cat <<EOF | kubectl apply -f -
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: gpu-pod
|
||
spec:
|
||
restartPolicy: Never
|
||
containers:
|
||
- name: cuda-container
|
||
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
|
||
resources:
|
||
limits:
|
||
nvidia.com/gpu: 1 # requesting 1 GPU
|
||
tolerations:
|
||
- key: nvidia.com/gpu
|
||
operator: Exists
|
||
effect: NoSchedule
|
||
EOF
|
||
```
|
||
|
||
2. 检查日志
|
||
|
||
```bash
|
||
kubectl logs gpu-pod
|
||
```
|
||
|
||
> 日志如下即代表 Pod 已可以使用 GPU 资源
|
||
>
|
||
> ```bash
|
||
> [Vector addition of 50000 elements]
|
||
> Copy input data from the host memory to the CUDA device
|
||
> CUDA kernel launch with 196 blocks of 256 threads
|
||
> Copy output data from the CUDA device to the host memory
|
||
> Test PASSED
|
||
> Done
|
||
> ```
|
||
|
||
3. 清理测试 Pod
|
||
|
||
```bash
|
||
kubectl delete pod gpu-pod
|
||
```
|
||
|
||
# GPU 共享访问
|
||
|
||
[官方文档](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#shared-access-to-gpus)
|
||
|
||
NVIDIA 设备插件通过其配置文件中一组扩展选项允许 GPU 的超额分配。有两种可用的共享方式:时间切片和 MPS。
|
||
|
||
注意:时间切片和 MPS 的使用是互斥的。
|
||
|
||
- 在时间切片的情况下,CUDA 时间切片用于允许共享 GPU 的工作负载相互交错。然而,并未采取特殊措施来隔离从同一底层 GPU 获得副本的工作负载,每个工作负载都可以访问 GPU 内存,并在与其他所有工作负载相同的故障域中运行(这意味着如果一个工作负载崩溃,它们全部都会崩溃)。
|
||
|
||
- 在 MPS 的情况下,使用控制守护程序来管理对共享 GPU 的访问。与时间切片相反,MPS 进行空间分区,并允许内存和计算资源被显式地分区,并对每个工作负载强制执行这些限制。
|
||
|
||
## 使用 CUDA 时间切片
|
||
|
||
1. 创建配置文件
|
||
|
||
```yaml
|
||
cat << EOF > /tmp/dp-config.yaml
|
||
version: v1
|
||
sharing:
|
||
timeSlicing:
|
||
resources:
|
||
- name: nvidia.com/gpu
|
||
replicas: 10
|
||
EOF
|
||
```
|
||
|
||
> 如果将此配置应用于具有 8 个 GPU 的节点,则该插件现在将向`nvidia.com/gpu`Kubernetes 通告 80 个资源,而不是 8 个。
|
||
|
||
2. 更新 NVIDIA-K8s-Device-Plugin插件
|
||
|
||
```bash
|
||
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
|
||
--namespace nvidia-device-plugin \
|
||
--create-namespace \
|
||
--set-file config.map.config=/tmp/dp-config.yaml
|
||
```
|
||
|
||
|
||
## 使用 CUDA MPS
|
||
|
||
> 目前在启用了 MIG 的设备上不支持使用 MPS 进行共享
|
||
>
|
||
|
||
1. 创建配置文件
|
||
|
||
```yaml
|
||
cat << EOF > /tmp/dp-config.yaml
|
||
version: v1
|
||
sharing:
|
||
mps:
|
||
resources:
|
||
- name: nvidia.com/gpu
|
||
replicas: 10
|
||
EOF
|
||
```
|
||
|
||
> 如果将此配置应用于具有 8 个 GPU 的节点,则该插件现在将向`nvidia.com/gpu`Kubernetes 通告 80 个资源,而不是 8 个。每块卡会按照 10 分之一的资源来作为 `nvidia.com/gpu: 1` 受用。
|
||
|
||
2. 添加节点标签
|
||
|
||
```bash
|
||
kubectl label nodes ${node} nvidia.com/mps.capable=true
|
||
```
|
||
|
||
3. 更新 NVIDIA-K8s-Device-Plugin插件
|
||
|
||
```bash
|
||
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
|
||
--namespace nvidia-device-plugin \
|
||
--create-namespace \
|
||
--set-file config.map.config=/tmp/dp-config.yaml
|
||
```
|
||
|
||
|