synchronization

This commit is contained in:
2025-08-25 17:53:08 +08:00
commit c201eb5ef9
318 changed files with 23092 additions and 0 deletions

View File

@@ -0,0 +1,122 @@
> 本文作者:丁辉
# Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin
[Github-gpushare-scheduler-extender](https://github.com/AliyunContainerService/gpushare-scheduler-extender/tree/master)
[Github-gpushare-device-plugin](https://github.com/AliyunContainerService/gpushare-device-plugin)
## 部署
1. 拉取 Gpushare-Scheduler-Extender 代码文件
```bash
git clone https://github.com/AliyunContainerService/gpushare-scheduler-extender.git
cd gpushare-scheduler-extender/config/ && vi gpushare-schd-extender.yaml
```
> 因为源文件的调度规则是指运行在 master 节点上,如果像我一样集群内并没有这个标签的话则需要修改 `gpushare-schd-extender.yaml` YAML文件将下面这段 nodeSelector 删除掉或注释。
>
> ```bash
> #nodeSelector:
> #node-role.kubernetes.io/master: ""
> ```
2. 部署扩展器
```bash
kubectl apply -f gpushare-schd-extender.yaml
```
3. 编写调度器配置文件
```bash
mv scheduler-policy-config.yaml /etc/kubernetes/scheduler-policy-config.yaml
vi /etc/kubernetes/scheduler-policy-config.yaml
```
> 根据自己 Scheduler 配置文件位置修改 `kubeconfig` 字段参数
>
4. 添加 Scheduler 启动参数
- 新版
```bash
- --config=/etc/kubernetes/scheduler-policy-config.yaml
```
- 老版本集群
```bash
- --policy-config-file: /etc/kubernetes/scheduler-policy-config.json
```
5. 给 GPU 节点打上标签
```bash
kubectl label node ${node} gpushare=true
```
6. 拉取 Gpushare-Device-Plugin 代码文件
```bash
git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
cd gpushare-device-plugin
```
7. 部署
```bash
kubectl apply -f device-plugin-rbac.yaml
kubectl apply -f device-plugin-ds.yaml
```
> 根据自己需求判断是否修改 `device-plugin-ds.yaml` 文件内默认 GPU 资源申请单位, `MiB` 还是 `GiB`
>
> ```bash
> command:
> - gpushare-device-plugin-v2
> - -logtostderr
> - --v=5
> - --memory-unit=MiB
> ```
8. 安装 kubectl GPU 插件
[插件下载](https://github.com/AliyunContainerService/gpushare-device-plugin/releases/)
```bash
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
```
```bash
mkdir ~/.kube/plugins -p
mv kubectl-inspect-gpushare ~/.kube/plugins/ && chmod 777 ~/.kube/plugins/kubectl-inspect-gpushare
```
9. 查看 GPU 使用情况
```bash
kubectl inspect gpushare
```
## 测试
1. 部署容器测试
```bash
kubectl apply -f https://gitee.com/offends/Kubernetes/raw/main/File/Yaml/aliyun-gpu-pod.yaml
```
2. 测试
```bash
kubectl exec nginx-pod nvidia-smi
```
3. 删除测试容器
```bash
kubectl delete pod gpu-pod
```

10
GPU/README.md Normal file
View File

@@ -0,0 +1,10 @@
> 本文作者:丁辉
# GPU的使用
> 相关文档地址
- [Linux下载并安装GPU驱动](https://gitee.com/offends/Linux/blob/main/Docs/Linux%E4%B8%8B%E8%BD%BD%E5%B9%B6%E5%AE%89%E8%A3%85GPU%E9%A9%B1%E5%8A%A8.md)
- [GPU容器化基础环境准备](https://gitee.com/offends/Kubernetes/blob/main/Docker/Docs/Docker%E4%BD%BF%E7%94%A8GPU.md)
- [Helm部署NVIDIA-K8s-Device-Plugin插件](https://gitee.com/offends/Kubernetes/blob/main/Helm/Helm%E9%83%A8%E7%BD%B2NVIDIA-K8s-Device-Plugin.md)
- [Kubernetes-NVIDIA之阿里云插件Gpushare-Device-Plugin](https://gitee.com/offends/Kubernetes/blob/main/GPU/Kubernetes-NVIDIA%E4%B9%8B%E9%98%BF%E9%87%8C%E4%BA%91%E6%8F%92%E4%BB%B6Gpushare-Device-Plugin.md)