Kubernetes/GPU/Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin.md
offends 7a2f41e7d6
All checks were successful
continuous-integration/drone Build is passing
synchronization
2024-08-07 18:54:39 +08:00

2.8 KiB

本文作者:丁辉

Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin

Github-gpushare-scheduler-extender

Github-gpushare-device-plugin

部署

  1. 拉取 Gpushare-Scheduler-Extender 代码文件

    git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
    cd gpushare-scheduler-extender/config/ && vi gpushare-schd-extender.yaml
    

    因为源文件的调度规则是指运行在 master 节点上,如果像我一样集群内并没有这个标签的话则需要修改 gpushare-schd-extender.yaml YAML文件将下面这段 nodeSelector 删除掉或注释。

    #nodeSelector:
      #node-role.kubernetes.io/master: ""
    
  2. 部署扩展器

    kubectl apply -f gpushare-schd-extender.yaml
    
  3. 编写调度器配置文件

    mv scheduler-policy-config.yaml /etc/kubernetes/scheduler-policy-config.yaml
    vi /etc/kubernetes/scheduler-policy-config.yaml
    

    根据自己 Scheduler 配置文件位置修改 kubeconfig 字段参数

  4. 添加 Scheduler 启动参数

    • 新版

      - --config=/etc/kubernetes/scheduler-policy-config.yaml
      
    • 老版本集群

      - --policy-config-file: /etc/kubernetes/scheduler-policy-config.json
      
  5. 给 GPU 节点打上标签

    kubectl label node ${node} gpushare=true
    
  6. 拉取 Gpushare-Device-Plugin 代码文件

    git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
    cd gpushare-device-plugin
    
  7. 部署

    kubectl apply -f device-plugin-rbac.yaml
    kubectl apply -f  device-plugin-ds.yaml
    

    根据自己需求判断是否修改 device-plugin-ds.yaml 文件内默认 GPU 资源申请单位, MiB 还是 GiB

    command:
      - gpushare-device-plugin-v2
      - -logtostderr
      - --v=5
      - --memory-unit=MiB
    
  8. 安装 kubectl GPU 插件

    插件下载

    wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
    
    mkdir ~/.kube/plugins -p
    mv kubectl-inspect-gpushare ~/.kube/plugins/ && chmod 777 ~/.kube/plugins/kubectl-inspect-gpushare
    
  9. 查看 GPU 使用情况

    kubectl inspect gpushare
    

测试

  1. 部署容器测试

    kubectl apply -f https://gitee.com/offends/Kubernetes/raw/main/File/Yaml/aliyun-gpu-pod.yaml
    
  2. 测试

    kubectl exec nginx-pod nvidia-smi
    
  3. 删除测试容器

    kubectl delete pod gpu-pod