> 本文作者:丁辉 # Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin [Github-gpushare-scheduler-extender](https://github.com/AliyunContainerService/gpushare-scheduler-extender/tree/master) [Github-gpushare-device-plugin](https://github.com/AliyunContainerService/gpushare-device-plugin) ## 部署 1. 拉取 Gpushare-Scheduler-Extender 代码文件 ```bash git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git cd gpushare-scheduler-extender/config/ && vi gpushare-schd-extender.yaml ``` > 因为源文件的调度规则是指运行在 master 节点上,如果像我一样集群内并没有这个标签的话则需要修改 `gpushare-schd-extender.yaml` YAML文件将下面这段 nodeSelector 删除掉或注释。 > > ```bash > #nodeSelector: > #node-role.kubernetes.io/master: "" > ``` 2. 部署扩展器 ```bash kubectl apply -f gpushare-schd-extender.yaml ``` 3. 编写调度器配置文件 ```bash mv scheduler-policy-config.yaml /etc/kubernetes/scheduler-policy-config.yaml vi /etc/kubernetes/scheduler-policy-config.yaml ``` > 根据自己 Scheduler 配置文件位置修改 `kubeconfig` 字段参数 > 4. 添加 Scheduler 启动参数 - 新版 ```bash - --config=/etc/kubernetes/scheduler-policy-config.yaml ``` - 老版本集群 ```bash - --policy-config-file: /etc/kubernetes/scheduler-policy-config.json ``` 5. 给 GPU 节点打上标签 ```bash kubectl label node ${node} gpushare=true ``` 6. 拉取 Gpushare-Device-Plugin 代码文件 ```bash git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git cd gpushare-device-plugin ``` 7. 部署 ```bash kubectl apply -f device-plugin-rbac.yaml kubectl apply -f device-plugin-ds.yaml ``` > 根据自己需求判断是否修改 `device-plugin-ds.yaml` 文件内默认 GPU 资源申请单位, `MiB` 还是 `GiB` > > ```bash > command: > - gpushare-device-plugin-v2 > - -logtostderr > - --v=5 > - --memory-unit=MiB > ``` 8. 安装 kubectl GPU 插件 [插件下载](https://github.com/AliyunContainerService/gpushare-device-plugin/releases/) ```bash wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare ``` ```bash mkdir ~/.kube/plugins -p mv kubectl-inspect-gpushare ~/.kube/plugins/ && chmod 777 ~/.kube/plugins/kubectl-inspect-gpushare ``` 9. 查看 GPU 使用情况 ```bash kubectl inspect gpushare ``` ## 测试 1. 部署容器测试 ```bash kubectl apply -f https://gitee.com/offends/Kubernetes/raw/main/File/Yaml/aliyun-gpu-pod.yaml ``` 2. 测试 ```bash kubectl exec nginx-pod nvidia-smi ``` 3. 删除测试容器 ```bash kubectl delete pod gpu-pod ```