123 lines
2.8 KiB
Markdown
123 lines
2.8 KiB
Markdown
> 本文作者:丁辉
|
|
|
|
# Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin
|
|
|
|
[Github-gpushare-scheduler-extender](https://github.com/AliyunContainerService/gpushare-scheduler-extender/tree/master)
|
|
|
|
[Github-gpushare-device-plugin](https://github.com/AliyunContainerService/gpushare-device-plugin)
|
|
|
|
## 部署
|
|
|
|
1. 拉取 Gpushare-Scheduler-Extender 代码文件
|
|
|
|
```bash
|
|
git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
|
|
cd gpushare-scheduler-extender/config/ && vi gpushare-schd-extender.yaml
|
|
```
|
|
|
|
> 因为源文件的调度规则是指运行在 master 节点上,如果像我一样集群内并没有这个标签的话则需要修改 `gpushare-schd-extender.yaml` YAML文件将下面这段 nodeSelector 删除掉或注释。
|
|
>
|
|
> ```bash
|
|
> #nodeSelector:
|
|
> #node-role.kubernetes.io/master: ""
|
|
> ```
|
|
|
|
2. 部署扩展器
|
|
|
|
```bash
|
|
kubectl apply -f gpushare-schd-extender.yaml
|
|
```
|
|
|
|
3. 编写调度器配置文件
|
|
|
|
```bash
|
|
mv scheduler-policy-config.yaml /etc/kubernetes/scheduler-policy-config.yaml
|
|
vi /etc/kubernetes/scheduler-policy-config.yaml
|
|
```
|
|
|
|
> 根据自己 Scheduler 配置文件位置修改 `kubeconfig` 字段参数
|
|
>
|
|
|
|
4. 添加 Scheduler 启动参数
|
|
|
|
- 新版
|
|
|
|
```bash
|
|
- --config=/etc/kubernetes/scheduler-policy-config.yaml
|
|
```
|
|
|
|
- 老版本集群
|
|
|
|
```bash
|
|
- --policy-config-file: /etc/kubernetes/scheduler-policy-config.json
|
|
```
|
|
|
|
5. 给 GPU 节点打上标签
|
|
|
|
```bash
|
|
kubectl label node ${node} gpushare=true
|
|
```
|
|
|
|
6. 拉取 Gpushare-Device-Plugin 代码文件
|
|
|
|
```bash
|
|
git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
|
|
cd gpushare-device-plugin
|
|
```
|
|
|
|
7. 部署
|
|
|
|
```bash
|
|
kubectl apply -f device-plugin-rbac.yaml
|
|
kubectl apply -f device-plugin-ds.yaml
|
|
```
|
|
|
|
> 根据自己需求判断是否修改 `device-plugin-ds.yaml` 文件内默认 GPU 资源申请单位, `MiB` 还是 `GiB`
|
|
>
|
|
> ```bash
|
|
> command:
|
|
> - gpushare-device-plugin-v2
|
|
> - -logtostderr
|
|
> - --v=5
|
|
> - --memory-unit=MiB
|
|
> ```
|
|
|
|
8. 安装 kubectl GPU 插件
|
|
|
|
[插件下载](https://github.com/AliyunContainerService/gpushare-device-plugin/releases/)
|
|
|
|
```bash
|
|
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
|
|
```
|
|
|
|
```bash
|
|
mkdir ~/.kube/plugins -p
|
|
mv kubectl-inspect-gpushare ~/.kube/plugins/ && chmod 777 ~/.kube/plugins/kubectl-inspect-gpushare
|
|
```
|
|
|
|
9. 查看 GPU 使用情况
|
|
|
|
```bash
|
|
kubectl inspect gpushare
|
|
```
|
|
|
|
## 测试
|
|
|
|
1. 部署容器测试
|
|
|
|
```bash
|
|
kubectl apply -f https://gitee.com/offends/Kubernetes/raw/main/File/Yaml/aliyun-gpu-pod.yaml
|
|
```
|
|
|
|
2. 测试
|
|
|
|
```bash
|
|
kubectl exec nginx-pod nvidia-smi
|
|
```
|
|
|
|
3. 删除测试容器
|
|
|
|
```bash
|
|
kubectl delete pod gpu-pod
|
|
```
|