Kubernetes根据Pod中Containers Resource的request
和limit
的值来定义Pod的QoS Class。
对于每一种Resource都可以将容器分为3中QoS Classes: Guaranteed, Burstable, and Best-Effort,它们的QoS级别依次递减。
limit
和request
都相等且不为0,则这个Pod的QoS Class就是Guaranteed。注意,如果一个容器只指明了limit,而未指明request,则表明request的值等于limit的值。
Examples:
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
Examples:
containers:
name: foo
resources:
name: bar
resources:
当limit值未指定时,其有效值其实是对应Node Resource的Capacity。
Examples:
容器bar
没有对Resource进行指定。
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
容器foo
和bar
对不同的Resource进行了指定。
containers:
name: foo
resources:
limits:
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
容器foo
未指定limit,容器bar
未指定request和limit。
containers:
name: foo
resources:
requests:
cpu: 10m
memory: 1Gi
name: bar
kube-scheduler调度时,是基于Pod的request
值进行Node Select完成调度的。Pod和它的所有Container都不允许Consume limit指定的有效值(if have)。
How the request and limit are enforced depends on whether the resource is compressible or incompressible.
Pod OOM score configuration
Best-effort
Guaranteed
Burstable
OOM_SCORE_ADJ
to 1000 - 10 * (% of memory requested)0
, OOM_SCORE_ADJ
is set to 999
.OOM_SCORE
will be 1000, if not its OOM_SCORE
will be < 1000Pod infra containers or Special Pod init process
OOM_SCORE_ADJ
: -998Kubelet, Docker
OOM_SCORE_ADJ
: -999 (won’t be OOM killed)QoS的源码位于:pkg/kubelet/qos
,代码非常简单,主要就两个文件pkg/kubelet/qos/policy.go
,pkg/kubelet/qos/qos.go
。
上面讨论的各个QoS Class对应的OOM_SCORE_ADJ
定义在:
pkg/kubelet/qos/policy.go:21
const (
PodInfraOOMAdj int = -998
KubeletOOMScoreAdj int = -999
DockerOOMScoreAdj int = -999
KubeProxyOOMScoreAdj int = -999
guaranteedOOMScoreAdj int = -998
besteffortOOMScoreAdj int = 1000
)
容器的OOM_SCORE_ADJ的计算方法定义在:
pkg/kubelet/qos/policy.go:40
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
switch GetPodQOS(pod) {
case Guaranteed:
// Guaranteed containers should be the last to get killed.
return guaranteedOOMScoreAdj
case BestEffort:
return besteffortOOMScoreAdj
}
// Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
// we want to protect Burstable containers that consume less memory than requested.
// The formula below is a heuristic. A container requesting for 10% of a system's
// memory will have an OOM score adjust of 900. If a process in container Y
// uses over 10% of memory, its OOM score will be 1000. The idea is that containers
// which use more than their request will have an OOM score of 1000 and will be prime
// targets for OOM kills.
// Note that this is a heuristic, it won't work if a container has many small processes.
memoryRequest := container.Resources.Requests.Memory().Value()
oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
// A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure
// that burstable pods have a higher OOM score adjustment.
if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
return (1000 + guaranteedOOMScoreAdj)
}
// Give burstable pods a higher chance of survival over besteffort pods.
if int(oomScoreAdjust) == besteffortOOMScoreAdj {
return int(oomScoreAdjust - 1)
}
return int(oomScoreAdjust)
}
获取Pod的QoS Class的方法为:
pkg/kubelet/qos/qos.go:50
// GetPodQOS returns the QoS class of a pod.
// A pod is besteffort if none of its containers have specified any requests or limits.
// A pod is guaranteed only when requests and limits are specified for all the containers and they are equal.
// A pod is burstable if limits and requests do not match across all containers.
func GetPodQOS(pod *v1.Pod) QOSClass {
requests := v1.ResourceList{}
limits := v1.ResourceList{}
zeroQuantity := resource.MustParse("0")
isGuaranteed := true
for _, container := range pod.Spec.Containers {
// process requests
for name, quantity := range container.Resources.Requests {
if !supportedQoSComputeResources.Has(string(name)) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
delta := quantity.Copy()
if _, exists := requests[name]; !exists {
requests[name] = *delta
} else {
delta.Add(requests[name])
requests[name] = *delta
}
}
}
// process limits
qosLimitsFound := sets.NewString()
for name, quantity := range container.Resources.Limits {
if !supportedQoSComputeResources.Has(string(name)) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
qosLimitsFound.Insert(string(name))
delta := quantity.Copy()
if _, exists := limits[name]; !exists {
limits[name] = *delta
} else {
delta.Add(limits[name])
limits[name] = *delta
}
}
}
if len(qosLimitsFound) != len(supportedQoSComputeResources) {
isGuaranteed = false
}
}
if len(requests) == 0 && len(limits) == 0 {
return BestEffort
}
// Check is requests match limits for all resources.
if isGuaranteed {
for name, req := range requests {
if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
isGuaranteed = false
break
}
}
}
if isGuaranteed &&
len(requests) == len(limits) {
return Guaranteed
}
return Burstable
}
PodQoS会在eviction_manager和scheduler的Predicates阶段被调用,也就说会在k8s处理超配和调度预选阶段中被使用。