在之前文章中,kube-schedule原理,当中我们说到了k8s原始的调度,有一些不合理性,当时也介绍了一些优先级调度以及自定义调度,下面主要说下这个开源的二次调度工具Descheduler。
该策略确保只有一个Pod与同一节点上运行的副本集(RS),Replication Controller(RC),deployment或者job关联。
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
params:
removeDuplicates:
excludeOwnerKinds:
- "ReplicaSet"
该策略发现未充分利用的节点,并且如果可能的话,从其他节点驱逐pod,希望在这些未充分利用的节点上安排被驱逐的pod的重新创建。此策略的参数配置在 nodeResourceUtilizationThresholds
。
节点的利用率低是由可配置的阈值决定的 thresholds
。thresholds
可以按百分比为cpu,内存和pod数量配置阈值 。如果节点的使用率低于所有(cpu,内存和pod数)的阈值,则该节点被视为未充分利用。目前,pods的请求资源需求被考虑用于计算节点资源利用率。
还有另一个可配置的阈值,targetThresholds
用于计算可以驱逐pod的潜在节点。任何节点,所述阈值之间,thresholds
并且 targetThresholds
被视为适当地利用,并且不考虑驱逐。阈值 targetThresholds
也可以按百分比配置为cpu,内存和pod数量。
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
targetThresholds:
"cpu" : 50
"memory": 50
"pods": 50
该策略可确保从节点中删除违反Interpod反亲和关系的pod。例如,如果某个节点上有podA
,并且podB
和podC
(在同一节点上运行)具有禁止它们在同一节点上运行的反亲和规则,则podA
将被从该节点逐出,以便podB
和podC
正常运行。当 podB
和 podC
已经运行在节点上后,反亲和性规则被创建就会发送这样的问题。目前,没有与该策略关联的参数。要禁用此策略,策略应如下所示:
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingInterPodAntiAffinity":
enabled: false
启用后,该策略requiredDuringSchedulingRequiredDuringExecution
将用作kubelet 的临时实现并逐出该kubelet,不再考虑节点亲和力。
例如,在nodeA上调度了podA,该podA满足了调度时的节点亲缘性规则requiredDuringSchedulingIgnoredDuringExecution
。随着时间的流逝,nodeA停止满足该规则。当执行该策略并且有另一个可用的节点满足该节点相似性规则时,podA被从nodeA中逐出。
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeAffinity":
enabled: true
params:
nodeAffinityType:
- "requiredDuringSchedulingIgnoredDuringExecution"
该策略可以确保从节点中删除违反 NoSchedule
污点的 Pod
。例如,有一个名为 podA
的 Pod
,通过配置容忍 key=value:NoSchedule
允许被调度到有该污点配置的节点上,如果节点的污点随后被更新或者删除了,则污点将不再被 Pod
的容忍满足,然后将被驱逐
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsViolatingNodeTaints":
enabled: true
此策略确保从节点中删除重启次数过多的Pod 。
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemovePodsHavingTooManyRestarts":
enabled: true
params:
podsHavingTooManyRestarts:
podRestartThreshold: 100
includingInitContainers: true
此策略逐出比.strategies.PodLifeTime.params.maxPodLifeTimeSeconds
该策略文件更旧的pod。
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 86400
Descheduler | Supported Kubernetes Version |
---|---|
v0.18 | v1.18 |
v0.10 | v1.17 |
v0.4-v0.9 | v1.9+ |
v0.1-v0.3 | v1.7-v1.8 |
「注意」:由于生产集群一般都是1.17以前的版本,故本实例是Descheduler0.9版本。
创建角色与账户
[root@master01 kubernetes]# cat rbac.yaml
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: descheduler-cluster-role
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: descheduler-sa
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: descehduler-cluster-role-binding
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: descheduler-cluster-role
subjects:
- name: descheduler-sa
kind: ServiceAccount
namespace: kube-system
创建configmap
[root@master01 kubernetes]# cat configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy-configmap
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"pods": 5
targetThresholds:
"pods": 10
任务
apiVersion: batch/v1
kind: Job
metadata:
name: descheduler-job
namespace: kube-system
spec:
parallelism: 1
completions: 1
template:
metadata:
name: descheduler-pod
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
spec:
containers:
- name: descheduler
image: aveshagarwal/descheduler:0.9.0
volumeMounts:
- mountPath: /policy-dir
name: policy-volume
command:
- "/bin/descheduler"
args:
- "--policy-config-file"
- "/policy-dir/policy.yaml"
- "--v"
- "3"
restartPolicy: "Never"
serviceAccountName: descheduler-sa
volumes:
- name: policy-volume
configMap:
name: descheduler-policy-configmap
https://github.com/kubernetes-sigs/descheduler