Author: xidianwangtao@gmail.com
从1.9版本开始,Kubernetes实现了基于Pod优先级的调度队列,一方面提供高优先级的Pod优先被调度的能力,另一方面减轻抢占式调度时潜在的High Priority Pod Starvation的问题,截止Kubernetes 1.10,PriorityPod Feature Gate仍处于Alpha。本文将从源码的层面对PriorityQueue进行深入分析,了解内部的两个Sub-Queue以及在什么情况下操作这两个Sub-Queue的,又是如何操作的,另外也提醒当前实现还可能存在的问题。
从Kubernetes 1.8开始,Scheduler提供了基于Pod Priorty的抢占式调度,我在解析Kubernetes 1.8中的基于Pod优先级的抢占式调度和Kubernetes 1.8抢占式调度Preemption源码分析中对此做过深入分析。但这还不够,当时调度队列只有FIFO类型,并不支持优先级队列,这会导致High Priority Pod抢占Lower Priority Pod后再次进入FIFO队列中排队,经常会导致抢占的资源被队列前面的Lower Priority Pod占用,导致High Priority Pod Starvation的问题。为了减轻这一问题,从Kubernetes 1.9开始提供Pod优先级的调度队列,即PriorityQueue,这同样需要用户打开PodPriority这个Feature Gate。
先看看PriorityQueue的结构定义。
type PriorityQueue struct {
lock sync.RWMutex
cond sync.Cond
activeQ *Heap
unschedulableQ *UnschedulablePodsMap
nominatedPods map[string][]*v1.Pod
receivedMoveRequest bool
}
pod.Name + "_" + pod.Namespace
,value为那些已经尝试调度并且调度失败的UnSchedulable的Pod Object。NominatedNodeName
Annotation,表示经过抢占调度的逻辑后,该Pod希望能调度到NominatedNodeName
这个Node上,调度时会考虑这个,防止高优先级的Pods进行抢占调度释放了低优先级Pods到它被再次调度这个时间段内,抢占的资源又被低优先级的Pods占用了。关于scheduler怎么处理Nominated Pods,我后续会单独写篇博客来分析。active是真正实现优先级调度的Heap,我们继续看看这个Heap的实现。
type Heap struct {
data *heapData
}
type heapData struct {
items map[string]*heapItem
queue []string
keyFunc KeyFunc
lessFunc LessFunc
}
type heapItem struct {
obj interface{} // The object which is stored in the heap.
index int // The index of the object's key in the Heap.queue.
}
heapData是activeQ中真正用来存放items的结构:
在scheduler config factory创建时,会注册podQueue的创建Func为NewSchedulingQueue。NewSchedulingQueue会检查PodPriority Feature Gate是否enable(截止Kubernetes 1.10版本,默认disable),如果PodPriority enable,则会invoke NewPriorityQueue创建PriorityQueue来管理未调度的Pods。如果PodPriority disable,则使用大家熟悉的FIFO Queue。
func NewSchedulingQueue() SchedulingQueue {
if util.PodPriorityEnabled() {
return NewPriorityQueue()
}
return NewFIFO()
}
NewPriorityQueue初始化优先级队列代码如下。
// NewPriorityQueue creates a PriorityQueue object.
func NewPriorityQueue() *PriorityQueue {
pq := &PriorityQueue{
activeQ: newHeap(cache.MetaNamespaceKeyFunc, util.HigherPriorityPod),
unschedulableQ: newUnschedulablePodsMap(),
nominatedPods: map[string][]*v1.Pod{},
}
pq.cond.L = &pq.lock
return pq
}
newHeap构建activeQ的时候,传入两个参数,第一个就是keyFunc: MetaNamespaceKeyFunc。
func MetaNamespaceKeyFunc(obj interface{}) (string, error) {
if key, ok := obj.(ExplicitKey); ok {
return string(key), nil
}
meta, err := meta.Accessor(obj)
if err != nil {
return "", fmt.Errorf("object has no meta: %v", err)
}
if len(meta.GetNamespace()) > 0 {
return meta.GetNamespace() + "/" + meta.GetName(), nil
}
return meta.GetName(), nil
}
newHeap传入的第二个参数是lessFunc:HigherPriorityPod。
const (
DefaultPriorityWhenNoDefaultClassExists = 0
)
func HigherPriorityPod(pod1, pod2 interface{}) bool {
return GetPodPriority(pod1.(*v1.Pod)) > GetPodPriority(pod2.(*v1.Pod))
}
func GetPodPriority(pod *v1.Pod) int32 {
if pod.Spec.Priority != nil {
return *pod.Spec.Priority
}
return scheduling.DefaultPriorityWhenNoDefaultClassExists
}
注意:如果pod.Spec.Priority为nil(意味着这个Pod在创建时集群里还没有对应的global default PriorityClass Object),并不是去把现在global default PriorityClass中的值设置给这个Pod.Spec.Priority,而是设置为0。个人觉得,设置为默认值比较合理。
unschedulableQ的构建是通过调用newUnschedulablePodsMap完成的,里面进行了UnschedulablePodsMap的pods的初始化,以及pods map中keyFunc的注册。
func newUnschedulablePodsMap() *UnschedulablePodsMap {
return &UnschedulablePodsMap{
pods: make(map[string]*v1.Pod),
keyFunc: util.GetPodFullName,
}
}
func GetPodFullName(pod *v1.Pod) string {
return pod.Name + "_" + pod.Namespace
}
注意:unschedulableQ中keyFunc实现的key生成规则是
pod.Name + "_" + pod.Namespace
,不同于activeQ中keyFunc(格式为"meta.GetNamespace() + "/" + meta.GetName")。我也不理解为何要搞成两种不同的格式,统一按照activeQ中的keyFunc就很好。
前面了解了PriorityQueue的结构,接着我们就要思考怎么往优先级Heap(activeQ)中添加对象了。
func (h *Heap) Add(obj interface{}) error {
key, err := h.data.keyFunc(obj)
if err != nil {
return cache.KeyError{Obj: obj, Err: err}
}
if _, exists := h.data.items[key]; exists {
h.data.items[key].obj = obj
heap.Fix(h.data, h.data.items[key].index)
} else {
heap.Push(h.data, &itemKeyValue{key, obj})
}
return nil
}
func Push(h Interface, x interface{}) {
h.Push(x)
up(h, h.Len()-1)
}
func up(h Interface, j int) {
for {
i := (j - 1) / 2 // parent
if i == j || !h.Less(j, i) {
break
}
h.Swap(i, j)
j = i
}
}
func (h *heapData) Less(i, j int) bool {
if i > len(h.queue) || j > len(h.queue) {
return false
}
itemi, ok := h.items[h.queue[i]]
if !ok {
return false
}
itemj, ok := h.items[h.queue[j]]
if !ok {
return false
}
return h.lessFunc(itemi.obj, itemj.obj)
}
使用PriorityQueue进行待调度Pod管理时,会从activeQ中Pop一个Pod出来,这个Pod是heap中的第一个Pod,也是优先级最高的Pod。
func (h *Heap) Pop() (interface{}, error) {
obj := heap.Pop(h.data)
if obj != nil {
return obj, nil
}
return nil, fmt.Errorf("object was removed from heap data")
}
func Pop(h Interface) interface{} {
n := h.Len() - 1
h.Swap(0, n)
down(h, 0, n)
return h.Pop()
}
func down(h Interface, i, n int) {
for {
j1 := 2*i + 1
if j1 >= n || j1 < 0 { // j1 < 0 after int overflow
break
}
j := j1 // left child
if j2 := j1 + 1; j2 < n && !h.Less(j1, j2) {
j = j2 // = 2*i + 2 // right child
}
if !h.Less(j, i) {
break
}
h.Swap(i, j)
i = j
}
}
了解了PriorityQueue及Pod进出Heap的原理之后,我们回到Scheduler Config Factory,看看Scheduler中podInformer、nodeInformer、serviceInformer、pvcInformer等注册的EventHandler中对PriorityQueue的操作。
func NewConfigFactory(...) scheduler.Configurator {
...
// scheduled pod cache
podInformer.Informer().AddEventHandler(
cache.FilteringResourceEventHandler{
FilterFunc: func(obj interface{}) bool {
switch t := obj.(type) {
case *v1.Pod:
return assignedNonTerminatedPod(t)
case cache.DeletedFinalStateUnknown:
if pod, ok := t.Obj.(*v1.Pod); ok {
return assignedNonTerminatedPod(pod)
}
runtime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, c))
return false
default:
runtime.HandleError(fmt.Errorf("unable to handle object in %T: %T", c, obj))
return false
}
},
Handler: cache.ResourceEventHandlerFuncs{
AddFunc: c.addPodToCache,
UpdateFunc: c.updatePodInCache,
DeleteFunc: c.deletePodFromCache,
},
},
)
// unscheduled pod queue
podInformer.Informer().AddEventHandler(
cache.FilteringResourceEventHandler{
FilterFunc: func(obj interface{}) bool {
switch t := obj.(type) {
case *v1.Pod:
return unassignedNonTerminatedPod(t)
case cache.DeletedFinalStateUnknown:
if pod, ok := t.Obj.(*v1.Pod); ok {
return unassignedNonTerminatedPod(pod)
}
runtime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, c))
return false
default:
runtime.HandleError(fmt.Errorf("unable to handle object in %T: %T", c, obj))
return false
}
},
Handler: cache.ResourceEventHandlerFuncs{
AddFunc: c.addPodToSchedulingQueue,
UpdateFunc: c.updatePodInSchedulingQueue,
DeleteFunc: c.deletePodFromSchedulingQueue,
},
},
)
// ScheduledPodLister is something we provide to plug-in functions that
// they may need to call.
c.scheduledPodLister = assignedPodLister{podInformer.Lister()}
nodeInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: c.addNodeToCache,
UpdateFunc: c.updateNodeInCache,
DeleteFunc: c.deleteNodeFromCache,
},
)
c.nodeLister = nodeInformer.Lister()
...
// This is for MaxPDVolumeCountPredicate: add/delete PVC will affect counts of PV when it is bound.
pvcInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: c.onPvcAdd,
UpdateFunc: c.onPvcUpdate,
DeleteFunc: c.onPvcDelete,
},
)
c.pVCLister = pvcInformer.Lister()
// This is for ServiceAffinity: affected by the selector of the service is updated.
// Also, if new service is added, equivalence cache will also become invalid since
// existing pods may be "captured" by this service and change this predicate result.
serviceInformer.Informer().AddEventHandler(
cache.ResourceEventHandlerFuncs{
AddFunc: c.onServiceAdd,
UpdateFunc: c.onServiceUpdate,
DeleteFunc: c.onServiceDelete,
},
)
c.serviceLister = serviceInformer.Lister()
...
}
通过assignedNonTerminatedPod FilterFunc过滤出那些已经Scheduled并且NonTerminated Pods,然后再对这些Pods的Add/Update/Delete Event Handler进行注册,这里我们只关注对PriorityQueue的操作。
// assignedNonTerminatedPod selects pods that are assigned and non-terminal (scheduled and running).
func assignedNonTerminatedPod(pod *v1.Pod) bool {
if len(pod.Spec.NodeName) == 0 {
return false
}
if pod.Status.Phase == v1.PodSucceeded || pod.Status.Phase == v1.PodFailed {
return false
}
return true
}
注册Add assignedNonTerminatedPod Event Handler为addPodToCache。
func (c *configFactory) addPodToCache(obj interface{}) {
...
c.podQueue.AssignedPodAdded(pod)
}
// AssignedPodAdded is called when a bound pod is added. Creation of this pod
// may make pending pods with matching affinity terms schedulable.
func (p *PriorityQueue) AssignedPodAdded(pod *v1.Pod) {
p.movePodsToActiveQueue(p.getUnschedulablePodsWithMatchingAffinityTerm(pod))
}
func (p *PriorityQueue) movePodsToActiveQueue(pods []*v1.Pod) {
p.lock.Lock()
defer p.lock.Unlock()
for _, pod := range pods {
if err := p.activeQ.Add(pod); err == nil {
p.unschedulableQ.delete(pod)
} else {
glog.Errorf("Error adding pod %v to the scheduling queue: %v", pod.Name, err)
}
}
p.receivedMoveRequest = true
p.cond.Broadcast()
}
// getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have
// any affinity term that matches "pod".
func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*v1.Pod {
p.lock.RLock()
defer p.lock.RUnlock()
var podsToMove []*v1.Pod
for _, up := range p.unschedulableQ.pods {
affinity := up.Spec.Affinity
if affinity != nil && affinity.PodAffinity != nil {
terms := predicates.GetPodAffinityTerms(affinity.PodAffinity)
for _, term := range terms {
namespaces := priorityutil.GetNamespacesFromPodAffinityTerm(up, &term)
selector, err := metav1.LabelSelectorAsSelector(term.LabelSelector)
if err != nil {
glog.Errorf("Error getting label selectors for pod: %v.", up.Name)
}
if priorityutil.PodMatchesTermsNamespaceAndSelector(pod, namespaces, selector) {
podsToMove = append(podsToMove, up)
break
}
}
}
}
return podsToMove
}
False
或者Unschedulable
时,才会将该Pod Add/Update到unschedulableQ,否则加入到activeQ。func (p _PriorityQueue) AddUnschedulableIfNotPresent(pod_ v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
if p.unschedulableQ.get(pod) != nil {
return fmt.Errorf("pod is already present in unschedulableQ")
}
if _, exists, _ := p.activeQ.Get(pod); exists {
return fmt.Errorf("pod is already present in the activeQ")
}
if !p.receivedMoveRequest && isPodUnschedulable(pod) {
p.unschedulableQ.addOrUpdate(pod)
p.addNominatedPodIfNeeded(pod)
return nil
}
err := p.activeQ.Add(pod)
if err == nil {
p.addNominatedPodIfNeeded(pod)
p.cond.Broadcast()
}
return err
}
注册Update assignedNonTerminatedPod Event Handler为updatePodInCache。
func (c *configFactory) updatePodInCache(oldObj, newObj interface{}) {
...
c.podQueue.AssignedPodUpdated(newPod)
}
// AssignedPodUpdated is called when a bound pod is updated. Change of labels
// may make pending pods with matching affinity terms schedulable.
func (p *PriorityQueue) AssignedPodUpdated(pod *v1.Pod) {
p.movePodsToActiveQueue(p.getUnschedulablePodsWithMatchingAffinityTerm(pod))
}
updatePodInCache中对podQueue的操作是AssignedPodUpdated,其实现同AssignedPodAdded,不再多说。
注册Delete assignedNonTerminatedPod Event Handler为deletePodFromCache。
func (c *configFactory) deletePodFromCache(obj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
func (p *PriorityQueue) MoveAllToActiveQueue() {
p.lock.Lock()
defer p.lock.Unlock()
for _, pod := range p.unschedulableQ.pods {
if err := p.activeQ.Add(pod); err != nil {
glog.Errorf("Error adding pod %v to the scheduling queue: %v", pod.Name, err)
}
}
p.unschedulableQ.clear()
p.receivedMoveRequest = true
p.cond.Broadcast()
}
如果集群中出现频繁删除pods的动作,会导致频繁将unSchedulableQ中的所有Pods移到activeQ中。如果unSchedulableQ中有个High Priority的Pod,那么就会导致频繁的抢占Lower Priority Pods的调度机会,使得Lower Priority Pod长期处于饥饿状态。关于这个问题,社区已经在考虑增加对应的back-off机制,减轻这种情况带来的影响。
通过unassignedNonTerminatedPod FilterFunc过滤出那些还未成功调度的并且NonTerminated Pods,然后再对这些Pods的Add/Update/Delete Event Handler进行注册,这里我们只关注对PriorityQueue的操作。
// unassignedNonTerminatedPod selects pods that are unassigned and non-terminal.
func unassignedNonTerminatedPod(pod *v1.Pod) bool {
if len(pod.Spec.NodeName) != 0 {
return false
}
if pod.Status.Phase == v1.PodSucceeded || pod.Status.Phase == v1.PodFailed {
return false
}
return true
}
注册Add unassignedNonTerminatedPod Event Handler为addPodToSchedulingQueue。
func (c *configFactory) addPodToSchedulingQueue(obj interface{}) {
if err := c.podQueue.Add(obj.(*v1.Pod)); err != nil {
runtime.HandleError(fmt.Errorf("unable to queue %T: %v", obj, err))
}
}
func (p *PriorityQueue) Add(pod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
err := p.activeQ.Add(pod)
if err != nil {
glog.Errorf("Error adding pod %v to the scheduling queue: %v", pod.Name, err)
} else {
if p.unschedulableQ.get(pod) != nil {
glog.Errorf("Error: pod %v is already in the unschedulable queue.", pod.Name)
p.deleteNominatedPodIfExists(pod)
p.unschedulableQ.delete(pod)
}
p.addNominatedPodIfNeeded(pod)
p.cond.Broadcast()
}
return err
}
注册Update unassignedNonTerminatedPod Event Handler为updatePodInSchedulingQueue。
func (c *configFactory) updatePodInSchedulingQueue(oldObj, newObj interface{}) {
pod := newObj.(*v1.Pod)
if c.skipPodUpdate(pod) {
return
}
if err := c.podQueue.Update(oldObj.(*v1.Pod), pod); err != nil {
runtime.HandleError(fmt.Errorf("unable to update %T: %v", newObj, err))
}
}
func (c *configFactory) skipPodUpdate(pod *v1.Pod) bool {
// Non-assumed pods should never be skipped.
isAssumed, err := c.schedulerCache.IsAssumedPod(pod)
if err != nil {
runtime.HandleError(fmt.Errorf("failed to check whether pod %s/%s is assumed: %v", pod.Namespace, pod.Name, err))
return false
}
if !isAssumed {
return false
}
// Gets the assumed pod from the cache.
assumedPod, err := c.schedulerCache.GetPod(pod)
if err != nil {
runtime.HandleError(fmt.Errorf("failed to get assumed pod %s/%s from cache: %v", pod.Namespace, pod.Name, err))
return false
}
// Compares the assumed pod in the cache with the pod update. If they are
// equal (with certain fields excluded), this pod update will be skipped.
f := func(pod *v1.Pod) *v1.Pod {
p := pod.DeepCopy()
// ResourceVersion must be excluded because each object update will
// have a new resource version.
p.ResourceVersion = ""
// Spec.NodeName must be excluded because the pod assumed in the cache
// is expected to have a node assigned while the pod update may nor may
// not have this field set.
p.Spec.NodeName = ""
// Annotations must be excluded for the reasons described in
// https://github.com/kubernetes/kubernetes/issues/52914.
p.Annotations = nil
return p
}
assumedPodCopy, podCopy := f(assumedPod), f(pod)
if !reflect.DeepEqual(assumedPodCopy, podCopy) {
return false
}
glog.V(3).Infof("Skipping pod %s/%s update", pod.Namespace, pod.Name)
return true
}
skipPodUpdate检查到以下情况同时发生时,都会返回true,表示忽略该pod update event。
func (p *PriorityQueue) Update(oldPod, newPod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
// If the pod is already in the active queue, just update it there.
if _, exists, _ := p.activeQ.Get(newPod); exists {
p.updateNominatedPod(oldPod, newPod)
err := p.activeQ.Update(newPod)
return err
}
// If the pod is in the unschedulable queue, updating it may make it schedulable.
if usPod := p.unschedulableQ.get(newPod); usPod != nil {
p.updateNominatedPod(oldPod, newPod)
if isPodUpdated(oldPod, newPod) {
p.unschedulableQ.delete(usPod)
err := p.activeQ.Add(newPod)
if err == nil {
p.cond.Broadcast()
}
return err
}
p.unschedulableQ.addOrUpdate(newPod)
return nil
}
// If pod is not in any of the two queue, we put it in the active queue.
err := p.activeQ.Add(newPod)
if err == nil {
p.addNominatedPodIfNeeded(newPod)
p.cond.Broadcast()
}
return err
}
当skipPodUpdate为true时,接着调用PriorityQueue.Update:
注册Delete unassignedNonTerminatedPod Event Handler为deletePodFromSchedulingQueue。
func (c *configFactory) deletePodFromSchedulingQueue(obj interface{}) {
...
if err := c.podQueue.Delete(pod); err != nil {
runtime.HandleError(fmt.Errorf("unable to dequeue %T: %v", obj, err))
}
...
}
func (p *PriorityQueue) Delete(pod *v1.Pod) error {
p.lock.Lock()
defer p.lock.Unlock()
p.deleteNominatedPodIfExists(pod)
err := p.activeQ.Delete(pod)
if err != nil { // The item was probably not found in the activeQ.
p.unschedulableQ.delete(pod)
}
return nil
}
NodeInformer注册了Node的Add/Update/Delete Event Handler,这里我们只关注这些Handler对PriorityQueue的操作。
func (c *configFactory) addNodeToCache(obj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
func (c *configFactory) updateNodeInCache(oldObj, newObj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
同
PodInformer EventHandler for Scheduled Pod
中提到的一样,如果集群中出现频繁增加或者更新Node的动作,会导致频繁将unSchedulableQ中的所有Pods移到activeQ中。如果unSchedulableQ中有个High Priority的Pod,那么就会导致频繁的抢占Lower Priority Pods的调度机会,使得Lower Priority Pod长期处于饥饿状态。
serviceInformer注册了Service的Add/Update/Delete Event Handler,这里我们只关注这些Handler对PriorityQueue的操作。
func (c *configFactory) onServiceAdd(obj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
func (c *configFactory) onServiceUpdate(oldObj interface{}, newObj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
func (c *configFactory) onServiceDelete(obj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
同
PodInformer EventHandler for Scheduled Pod
中提到的一样,如果集群中出现频繁Add/Update/Delete Service的动作,会导致频繁将unSchedulableQ中的所有Pods移到activeQ中。如果unSchedulableQ中有个High Priority的Pod,那么就会导致频繁的抢占Lower Priority Pods的调度机会,使得Lower Priority Pod长期处于饥饿状态。
pvcInformer注册了pvc的Add/Update/Delete Event Handler,这里我们只关注这些Handler对PriorityQueue的操作。
func (c *configFactory) onPvcAdd(obj interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
func (c *configFactory) onPvcUpdate(old, new interface{}) {
...
c.podQueue.MoveAllToActiveQueue()
}
同
PodInformer EventHandler for Scheduled Pod
中提到的一样,如果集群中出现频繁Add/Update PVC的动作,会导致频繁将unSchedulableQ中的所有Pods移到activeQ中。如果unSchedulableQ中有个High Priority的Pod,那么就会导致频繁的抢占Lower Priority Pods的调度机会,使得Lower Priority Pod长期处于饥饿状态。
本文基于Kubernetes 1.10的代码,对scheduler的PriorityQueue进行了代码分析,包括PriorityQueue的内部结构(两个重要的Sub-Queue),Pod如何Push进队列,Pod如何Pop出队列,以及Pod/Service/Node/PVC对象的Add/Update/Delete事件对PriorityQueue中两个Sub-Queue的操作等。如今的scheduler比起1.8之前的版本复杂了很多,后面我会再对scheduler相关的Equivalence Class,Nominated Pods,VolumeScheduling等方面单独写博客进行分析。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。