synchronization
This commit is contained in:
122
GPU/Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin.md
Normal file
122
GPU/Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin.md
Normal file
@@ -0,0 +1,122 @@
|
||||
> 本文作者:丁辉
|
||||
|
||||
# Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin
|
||||
|
||||
[Github-gpushare-scheduler-extender](https://github.com/AliyunContainerService/gpushare-scheduler-extender/tree/master)
|
||||
|
||||
[Github-gpushare-device-plugin](https://github.com/AliyunContainerService/gpushare-device-plugin)
|
||||
|
||||
## 部署
|
||||
|
||||
1. 拉取 Gpushare-Scheduler-Extender 代码文件
|
||||
|
||||
```bash
|
||||
git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
|
||||
cd gpushare-scheduler-extender/config/ && vi gpushare-schd-extender.yaml
|
||||
```
|
||||
|
||||
> 因为源文件的调度规则是指运行在 master 节点上,如果像我一样集群内并没有这个标签的话则需要修改 `gpushare-schd-extender.yaml` YAML文件将下面这段 nodeSelector 删除掉或注释。
|
||||
>
|
||||
> ```bash
|
||||
> #nodeSelector:
|
||||
> #node-role.kubernetes.io/master: ""
|
||||
> ```
|
||||
|
||||
2. 部署扩展器
|
||||
|
||||
```bash
|
||||
kubectl apply -f gpushare-schd-extender.yaml
|
||||
```
|
||||
|
||||
3. 编写调度器配置文件
|
||||
|
||||
```bash
|
||||
mv scheduler-policy-config.yaml /etc/kubernetes/scheduler-policy-config.yaml
|
||||
vi /etc/kubernetes/scheduler-policy-config.yaml
|
||||
```
|
||||
|
||||
> 根据自己 Scheduler 配置文件位置修改 `kubeconfig` 字段参数
|
||||
>
|
||||
|
||||
4. 添加 Scheduler 启动参数
|
||||
|
||||
- 新版
|
||||
|
||||
```bash
|
||||
- --config=/etc/kubernetes/scheduler-policy-config.yaml
|
||||
```
|
||||
|
||||
- 老版本集群
|
||||
|
||||
```bash
|
||||
- --policy-config-file: /etc/kubernetes/scheduler-policy-config.json
|
||||
```
|
||||
|
||||
5. 给 GPU 节点打上标签
|
||||
|
||||
```bash
|
||||
kubectl label node ${node} gpushare=true
|
||||
```
|
||||
|
||||
6. 拉取 Gpushare-Device-Plugin 代码文件
|
||||
|
||||
```bash
|
||||
git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
|
||||
cd gpushare-device-plugin
|
||||
```
|
||||
|
||||
7. 部署
|
||||
|
||||
```bash
|
||||
kubectl apply -f device-plugin-rbac.yaml
|
||||
kubectl apply -f device-plugin-ds.yaml
|
||||
```
|
||||
|
||||
> 根据自己需求判断是否修改 `device-plugin-ds.yaml` 文件内默认 GPU 资源申请单位, `MiB` 还是 `GiB`
|
||||
>
|
||||
> ```bash
|
||||
> command:
|
||||
> - gpushare-device-plugin-v2
|
||||
> - -logtostderr
|
||||
> - --v=5
|
||||
> - --memory-unit=MiB
|
||||
> ```
|
||||
|
||||
8. 安装 kubectl GPU 插件
|
||||
|
||||
[插件下载](https://github.com/AliyunContainerService/gpushare-device-plugin/releases/)
|
||||
|
||||
```bash
|
||||
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
|
||||
```
|
||||
|
||||
```bash
|
||||
mkdir ~/.kube/plugins -p
|
||||
mv kubectl-inspect-gpushare ~/.kube/plugins/ && chmod 777 ~/.kube/plugins/kubectl-inspect-gpushare
|
||||
```
|
||||
|
||||
9. 查看 GPU 使用情况
|
||||
|
||||
```bash
|
||||
kubectl inspect gpushare
|
||||
```
|
||||
|
||||
## 测试
|
||||
|
||||
1. 部署容器测试
|
||||
|
||||
```bash
|
||||
kubectl apply -f https://gitee.com/offends/Kubernetes/raw/main/File/Yaml/aliyun-gpu-pod.yaml
|
||||
```
|
||||
|
||||
2. 测试
|
||||
|
||||
```bash
|
||||
kubectl exec nginx-pod nvidia-smi
|
||||
```
|
||||
|
||||
3. 删除测试容器
|
||||
|
||||
```bash
|
||||
kubectl delete pod gpu-pod
|
||||
```
|
10
GPU/README.md
Normal file
10
GPU/README.md
Normal file
@@ -0,0 +1,10 @@
|
||||
> 本文作者:丁辉
|
||||
|
||||
# GPU的使用
|
||||
|
||||
> 相关文档地址
|
||||
|
||||
- [Linux下载并安装GPU驱动](https://gitee.com/offends/Linux/blob/main/Docs/Linux%E4%B8%8B%E8%BD%BD%E5%B9%B6%E5%AE%89%E8%A3%85GPU%E9%A9%B1%E5%8A%A8.md)
|
||||
- [GPU容器化基础环境准备](https://gitee.com/offends/Kubernetes/blob/main/Docker/Docs/Docker%E4%BD%BF%E7%94%A8GPU.md)
|
||||
- [Helm部署NVIDIA-K8s-Device-Plugin插件](https://gitee.com/offends/Kubernetes/blob/main/Helm/Helm%E9%83%A8%E7%BD%B2NVIDIA-K8s-Device-Plugin.md)
|
||||
- [Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin](https://gitee.com/offends/Kubernetes/blob/main/GPU/Kubernetes-NVIDIA%E4%B9%8B%E9%98%BF%E9%87%8C%E4%BA%91%E6%8F%92%E4%BB%B6Gpushare-Device-Plugin.md)
|
Reference in New Issue
Block a user