synchronization

This commit is contained in:
2025-08-25 16:04:00 +08:00
commit 33f9b3ce46
1951 changed files with 854396 additions and 0 deletions

View File

@@ -0,0 +1,30 @@
# Adopters Of GPUShare Scheduler
Below are the adopters of project GPUShare Scheduler. If you are using GPUShare to improve the GPU utilization in Kubernetes, please feel free to add yourself into the following list by a pull request. There're several phases as follow:
* **Evaluation:** Known GPUShare Scheduler, that's interesting; evaluating the features/scopes of GPUShare Scheduler
* **Testing:** Take GPUShare Scheduler as one of candidates, testing Kubernetes cluster with GPUShare Scheduler
* **Staging:** Decide to use GPUShare Scheduler, testing it in pre-product environment
* **Production:** Already put GPUShare Scheduler into product environment
| Organization | Contact | Phases | Description of Use |
| ------------ | ------- | ----------- | ------------------ |
| [JianPei](http://www.jianpeicn.com/) | [@lisongtao716](https://github.com/lisongtao716) | Testing | Medical image analysis |
| [Unisound](https://www.unisound.com/) | [@xieydd](https://github.com/xieydd) | Testing | Unisound ATLAS AI Training Platform |
| [Bytedance](https://www.bytedance.com) | [@ryzzn](https://github.com/ryzzn) | Testing | Render Platform |
| [TIANCHI](https://tianchi.aliyun.com) | [@gaoxiaos](https://github.com/gaoxiaos) | Staging | AI Competition Platform|
| [TAL AI](https://ai.100tal.com) | [@asas12350](https://github.com/asas12350) | **Production** | AI Inference Service Platform|
| [HuyaTech](https://www.huya.com) | [@BobLiu20](https://github.com/BobLiu20) | **Production** | HUYA AI Platform |
| [QTT BigData](http://www.qutoutiao.net/) | [@OopsOutOfMemory](https://github.com/OopsOutOfMemory) | **Production** | QTT AI Platform |
| [Taobao](http://www.taobao.com) | [@zxthunter](https://github.com/zxthunter) | **Production** | NU Algorithm Deployment Platform |
| [Heuritech](http://www.heuritech.com) | [@heuritech](https://github.com/heuritech) | **Production** | AI Inference for Fashion |
| [AliyunIoT](https://iot.aliyun.com/) | [@falltodis](https://github.com/falltodis) | **Production** | IoT Edge AI Platform |
| [Jiangsu Telecom](https://wapjs.189.cn/) | [@yangyuliufeng](https://github.com/yangyuliufeng) | **Production** | AI Platform on k8s |
| [Aliyun Industry Brain](https://et.aliyun.com/brain/industry) | [@xlk23](https://github.com/xlk23) | **Production** | EPIC Platform |
| [Weibo](https://www.weibo.com) | [@phoenixwu0229](https://github.com/phoenixwu0229) | **Production** | Weibo ML Platform |
| [Zuo Ye Bang](http://www.zuoyebang.com) | [@xkos](https://github.com/xkos) | **Production** | AI Platform on k8s |
| [Hellobike](https://www.helloglobal.com) | [@gwl-wolf](https://github.com/gwl-wolf) | **Production** | AIBrain Platform |
| [Gomo](https://www.gomo.com) | [@cxxx](https://github.com/cxxx) | **Production** | Image conversion |
| [Qihoo 360](https://www.360.cn) | [@70data](https://github.com/70data) | **Production** | Private Cloud Platform on K8s |
| [DIDI](https://www.didiglobal.com/) | [@tongchao199](https://github.com/tongchao199) | **Production** | AI Experimental Environment Service <br> AI Inference Service |
| [Mango TV](https://www.mgtv.com) | [@ftx0day](https://github.com/ftx0day) | **Production** | Mango CloudNative AI Platform |

Binary file not shown.

After

Width:  |  Height:  |  Size: 229 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 204 KiB

View File

@@ -0,0 +1,105 @@
# GPU Sharing in Kubernetes
## Background
The Kubernetes infrastructure enforces exclusive GPU usage, preventing sharing GPUs across pods. This is not good for the users who want to use the sharing capabilities of NVIDIA GPUs to increase GPU utilization in a cluster.
This can achieve better isolation, ensuring that the GPU usage of each application is not affected by other applications; it is very suitable for deep learning model training scenarios, but it is usually wasteful if the scenarios are model development and model inference. In general, when we talk about shared GPUs support at cluster level, we usually think about two concepts:
1. Isolation: this is the basis for sharing a device, such as the fault isolation, memory isolation, real parallelism of the sharing of
the resource in each container at runtime level. It's inherently defined by the hardware device and the software controlling that device in the node, such as MPS (Mutiple Processing Service). In fact, Kubernetes helps a little in this.
2. Scheduling: Kubernetes should help the user to express his requirements in the way devices should be shared, and follow the user's specification to guarantee that devices cannot be oversubscribed at the scheduling level. However Kubernetes cannot in any measure enforce that at the runtime level.
For fine-grained GPU device scheduling, there is currently no good solution. This is because the extended resources such as GPU, RDMA in Kubernetes restricts quantities of extended resources to whole numbers, cannot support the allocation of complex resources. For example, it's impossible for a user to ask for 0.5 GPU in a Kubernetes cluster. The essential problem here is that multi-device GPU sharing is a vector resource problem, while extended resources are descriptions of scalar resources.
## User Story
- As a cluster administrator, I want to increase the GPU usage of the cluster; during the development process, multiple users share the same model development environment in the same GPU.
- As an application operator, I want to be able to run multiple inference tasks on the same GPU at the same time.
## Goals
- Allow users to express requests for sharing a resource, and guarantee that the GPU cannot be oversubscribed at the scheduling level.
## Non Goals
- Isolation of this shared resource
- Oversubscription
## Design Principles
- Although there are two ways to measure GPU capabilities (CUDA cores and GPU Memory), in the inference scenarios, we can make the assumption that the number of CUDA cores and GPU Memory are proportional.
- Leverage Extended Resources to express device sharing requests by changing the measure unit from "number of GPUs" to "amount of GPU memory in MiB". If the GPU used by the node is a single device with 16GiB of memory, it can be expressed as 16276MiB.
- The user's appeal for the shared GPU is for the model development and prediction scenario. In these cases, the upper limit of the GPU resource requested by the user does not exceed one GPU, that is, the resource limit of the application is a single GPU.
- Do not change any Kubernetes barebone code, just leverage extended resource, scheduler extender and device plugin mechanism.
## Design
Define two new Extended Resources: the first is gpu-mem, which corresponds to GPU memory; the second is gpu-count, which corresponds to the number of GPU devices.
The diagram below describes the architecture:
![](arch.jpg)
### Core components
- **GPU Share Scheduler Extender**: It uses the Kubernetes scheduler extender mechanism, it is responsible for determining whether a single GPU device on the node can provide enough GPU Memory when the global scheduler Filter and Bind, and records the GPU allocation result to the Pod Spec Annotation for subsequent filter at the time of Bind.
- **GPU Share Device Plugin**: It uses the Device Plugin mechanism, it is responsible for the allocation of the GPU device according to the decision of the GPU Share Scheduler Extender recorded on the Pod Spec Annotation.
### Process
#### 1\. Device Resource Report
The GPU Share Device Plugin uses the nvml library to query the number of GPU devices and the GPU memory of devices. The total GPU memory (quantity * memory) of the node is reported to the Kubelet by `ListAndWatch()`; and Kubelet reports these to the Kubernetes API Server.
If the node has 2 GPUs, and each GPU has 16276MiB, the GPU Memory of the node is 16276 * 2 = 32552. In addition, the number of GPU devices on the node is also reported as another Extended Resource.
#### 2\. Schedule
The GPU Share Scheduler Extender records the allocation information into annotations, and determine whether each GPU has enough gpu-mem according to this information when the scheduler is doing the filtering.
2.1. After the Kubernetes scheduler finishes all the default filters, it calls the filter method of the GPU Share Scheduler Extender through http. This is because the default scheduler calculates the extended resource and can only determine whether the total amount of resources has free resources that meet the demand. Specifically determine whether the demand is met on a single device; therefore, it is necessary to check whether a single device has available resources by the GPU Share Scheduler Extender.
The following figure shows an example. There are 3 nodes with 2 GPU devices in a Kubernetes cluster, when the user applies for `gpu-mem=8138`, the default scheduler scans all the nodes and finds that the remaining resources of N1 are (16276 * 2 - 16276 - 12207 = 4069) so the resource requirements are not met, therefore the N1 node is filtered out.
The remaining resources of the N2 and N3 nodes are 8138MiB. They all meet the requirements of the default scheduler. At this time, the default scheduler delegates the GPU Share Scheduler Extender to do secondary filtering.
During the secondary filtering, the GPU Share Scheduler Extender needs to determine whether the single GPU devices meets the resource requirements. When checking the N2 node, it is found that although the node has 8138MiB available resources, it is spread on two devices: GPU0 and GPU1 have only 4069MiB of available resources. It can't meet the requirement of a single device with 8138MiB.
Although the N3 node has a total of 8138MiB available resources, these available resources belong to GPU0, which satisfies the single device requirements. Thus, accurate scheduling can be achieved by filtering the GPU Share Scheduler Extender.
![](filter.jpg)
2.2. When the scheduler finds a node that satisfies the requirements, it delegates the GPU Share Scheduler Extender to bind the node and the pod. Here, the extender needs to do two things:
- Find the GPU device in the node according to the binpack rule, record the GPU device ID and save it as `ALIYUN_GPU_ID` in the annotations of the pod. It also saves the GPU Memory of the pod application as `ALIYUN_COM_GPU_MEM_POD` and `ALIYUN_COM_GPU_MEM_ASSUME_TIME` in the annotations of the pod. If no GPU is found at the binding time, no binding is performed at this time. The default scheduler will reschedule after the expiration timeout.
> Notice: There is also a pod annotation named `ALIYUN_COM_GPU_MEM_ASSIGNED` which is initialized as `false`. It indicates that the pod is assumed with the GPU device in the schedule period, but not assigned at the runtime.
- Bind pod and node with Kubernetes API
For example, a user wants to request a pod with gpu-mem:8138 and the node N1 is selected: the available resources of different GPUs are analyzed first, namely GPU0 (12207), GPU1 (8138), GPU2 (4069) and GPU3 (16276). The remaining resources of GPU2 (4069) is not satisfying so the device is discarded; in the other three GPUs that satisfy the requirements, GPU1 (8138), which has the least remaining resources, is selected.
![](bind.jpg)
#### 3\. Run the deployment on the node
An `Allocate` function in GPU Share Device Plugin is called from Kubelet before creating the container (the parameter of `Allocate` is the GPU memory request amount):
3.1 Get all the Pending and GPU Share pods with the GPU memory request amount in this node ordered by assumedTimestamp from Kubernetes API Server
3.2 Choose the Pod with the GPU memory request amount specified in the parameter of the `Allocate` function. There may be some pods with the same GPU memory request amount. If so, it chooses the pod with the earliest assumedTimestamp.
3.3 Mark the chosen pod's annotation `ALIYUN_COM_GPU_MEM_ASSIGNED` as `true`, and indicate that the GPU device is assigned to the container in the runtime.
![](sequence.jpg)

Binary file not shown.

After

Width:  |  Height:  |  Size: 209 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 209 KiB

View File

@@ -0,0 +1,138 @@
# Installation guide
## 0\. Prepare GPU Node
This guide assumes that the NVIDIA drivers and nvidia-docker2 have been installed.
Enable the Nvidia runtime as your default runtime on your node. To do this, please edit the docker daemon config file which is usually present at /etc/docker/daemon.json:
```json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
```
> *if `runtimes` is not already present, head to the install page of [nvidia-docker](https://github.com/NVIDIA/nvidia-docker)*
## 1\. Deploy GPU share scheduler extender in control plane
```bash
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/gpushare-schd-extender.yaml
```
## 2\. Modify scheduler configuration
The goal is to include `scheduler-policy-config.json` into the scheduler configuration (`/etc/kubernetes/manifests/kube-scheduler.yaml`).
> Notice: If your Kubernetes default scheduler is deployed as static pod, don't edit the yaml file inside /etc/kubernetes/manifest. You need to edit the yaml file outside the `/etc/kubernetes/manifest` directory. and copy the yaml file you edited to the '/etc/kubernetes/manifest/' directory, and then kubernetes will update the default static pod with the yaml file automatically.
### 2.1 Kubernetes v1.23+
From Kubernetes v1.23 [scheduling policies are no longer supported](https://kubernetes.io/docs/reference/scheduling/policies/) instead [scheduler configurations](https://kubernetes.io/docs/reference/scheduling/config/) should be used.
That means `scheduler-policy-config.yaml` needs to be included in the scheduler config (`/etc/kubernetes/manifests/kube-scheduler.yaml`).
Here is the sample of the final modified [kube-scheduler.yaml](../config/kube-scheduler-v1.23+.yaml)
#### 2.1.1 Copy scheduler config file into /etc/kubernetes
```bash
cd /etc/kubernetes
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.yaml
```
#### 2.1.2 Add Policy config file parameter in scheduler arguments
```yaml
- --config=/etc/kubernetes/scheduler-policy-config.yaml
```
#### 2.1.3 Add volume mount into Pod Spec
```yaml
- mountPath: /etc/kubernetes/scheduler-policy-config.yaml
name: scheduler-policy-config
readOnly: true
```
```yaml
- hostPath:
path: /etc/kubernetes/scheduler-policy-config.yaml
type: FileOrCreate
name: scheduler-policy-config
```
### 2.2 Before Kubernetes v1.23
Here is the sample of the final modified [kube-scheduler.yaml](../config/kube-scheduler.yaml)
#### 2.2.1 Copy scheduler config file into /etc/kubernetes
```bash
cd /etc/kubernetes
curl -O https://raw.githubusercontent.com/AliyunContainerService/gpushare-scheduler-extender/master/config/scheduler-policy-config.json
```
#### 2.2.2 Add Policy config file parameter in scheduler arguments
```yaml
- --policy-config-file=/etc/kubernetes/scheduler-policy-config.json
```
#### 2.2.3 Add volume mount into Pod Spec
```yaml
- mountPath: /etc/kubernetes/scheduler-policy-config.json
name: scheduler-policy-config
readOnly: true
```
```yaml
- hostPath:
path: /etc/kubernetes/scheduler-policy-config.json
type: FileOrCreate
name: scheduler-policy-config
```
## 3\. Deploy Device Plugin
```bash
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-rbac.yaml
kubectl create -f https://raw.githubusercontent.com/AliyunContainerService/gpushare-device-plugin/master/device-plugin-ds.yaml
```
> Notice: please remove default GPU device plugin, for example, if you are using [nvidia-device-plugin](https://github.com/NVIDIA/k8s-device-plugin/blob/v1.11/nvidia-device-plugin.yml), you can run `kubectl delete ds -n kube-system nvidia-device-plugin-daemonset` to delete.
## 4\. Add gpushare node labels to the nodes requiring GPU sharing
You need to add a label "gpushare=true" to all node where you want to install device plugin because the device plugin is deamonset.
```bash
kubectl label node <target_node> gpushare=true
```
For example:
```bash
kubectl label node mynode gpushare=true
```
## 5\. Install Kubectl extension
### 5.1 Install kubectl 1.12 or above
You can download and install `kubectl` for linux
```bash
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.12.1/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/bin/kubectl
```
### 5.2 Download and install the kubectl extension
```bash
cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare
```

View File

@@ -0,0 +1,7 @@
## Problem Determination
1. If there is no way to find the gpushare node through `kubectl inspect gpushare`
1.1 kubectl get po -n kube-system -o=wide | grep gpushare-device
1.2 kubecl logs -n kube-system <pod_name>

View File

@@ -0,0 +1,79 @@
# User Guide
> Notice: Kubernetes provides GPU sharing scheduling capability, which is only a scheduling mechanism that
guarantees that devices can not be “oversubscribed” (at the scheduling level), but cannot in any
measure enforce that at the runtime level. For now, you have to take care of isolation by yourself.
1. Query the allocation status of the shared GPU
```bash
# kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
cn-shanghai.i-uf61h64dz1tmlob9hmtb 192.168.0.71 6/15 6/15
cn-shanghai.i-uf61h64dz1tmlob9hmtc 192.168.0.70 3/15 3/15
------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
9/30 (30%)
```
> For more details, please run `kubectl inspect gpushare -d`
2. To request GPU sharing, you just need to specify `aliyun.com/gpu-mem`
```yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 3
serviceName: "binpack-1"
podManagementPolicy: "Parallel"
selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-1
template: # define the pods specifications
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# GiB
aliyun.com/gpu-mem: 3
```
> Notice that the GPU memory of each GPU is 3 GiB, 3 GiB indicates one third of the GPU.
3\. From the following environment variables,the application can limit the GPU usage by using CUDA API or framework API, such as Tensorflow
```bash
# The total amount of GPU memory on the current device (GiB)
ALIYUN_COM_GPU_MEM_DEV=15
# The GPU Memory of the container (GiB)
ALIYUN_COM_GPU_MEM_CONTAINER=3
```
Limit GPU memory by setting fraction through TensorFlow API
```python
fraction = round( 3 * 0.7 / 15 , 1 )
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = fraction
sess = tf.Session(config=config)
# Runs the op.
while True:
sess.run(c)
```
> 0.7 is because tensorflow control gpu memory is not accurate, it is recommended to multiply by 0.7 to ensure that the upper limit is not exceeded.