圖解kubernetes調度器framework核心數據結構

2020-02-03   sandag

Framework是kubernetes擴展的第二種實現,相比SchedulerExtender基於遠程獨立Service的擴展,Framework核心則實現了一種基於擴展點的本地化的規範流程管理機制

1. 擴展實現目標

Framework的設計在官方文檔中已經有明確的描述,當前還沒有Stable, 本文目前基於1.18版本聊一聊除了官方描述外的實現的上的一些細節

1.1 階段擴展點

目前官方主要是圍繞著之前的預選和優選階段進行擴展,提供了更多的擴展點,其中每個擴展點都是一類插件,我們可以根據我們的需要在對應的階段來進行擴展插件的編寫,實現調度增強

在當前版本中優先級插件已經抽取到了framework中,後續應該會繼續將預選插件來進行抽取,這塊應該還得一段時間才能穩定

1.2 context與CycleState

在Framework的實現中,每個插件擴展階段調用都會傳遞context和CycleState兩個對象,其中context與我們在大多數go編程中的用法類似,這裡主要是用於多階段並行處理的時候的統一退出操作,而CycleState則存儲當前這一個調度周期內的所有數據,這是一個並發安全的結構,內部包含一個讀寫鎖

1.3 Bind Permit

Permit是在進行Bind綁定操作之前進行的一項操作,其主要設計目標是在進行bind之前,進行最後一道決策,即當前pod是否准許進行最終的Bind操作,具有一票否決權,如果裡面的插件拒絕,則對應的pod會重新進行調度

2. 核心源碼實現

2.1 Framework核心數據結構

Framework的核心數據結構簡單的來說分為三部分:插件集合(針對每個擴展階段都會有自己的集合)、元數據獲取接口(集群和快照數據的獲取)、等待Pod集合

2.1.1 插件集合

插件集合中會根據不同的插件類型,來進行分類保存, 其中還有一個插件的優先級存儲map,目前只有在優選階段使用,後續可能會加入預選的優先級

pluginNameToWeightMap map[string]int    queueSortPlugins      []QueueSortPlugin    preFilterPlugins      []PreFilterPlugin    filterPlugins         []FilterPlugin    postFilterPlugins     []PostFilterPlugin    scorePlugins          []ScorePlugin    reservePlugins        []ReservePlugin    preBindPlugins        []PreBindPlugin    bindPlugins           []BindPlugin    postBindPlugins       []PostBindPlugin    unreservePlugins      []UnreservePlugin    permitPlugins         []PermitPlugin

2.1.2 集群數據獲取

主要是集群中的一些數據獲取接口的實現,主要是為了實現FrameworkHandle, 該接口主要是提供一些數據的獲取的接口和集群操作的接口給插件使用

clientSet       clientset.Interface    informerFactory informers.SharedInformerFactory    volumeBinder    *volumebinder.VolumeBinder    snapshotSharedLister  schedulerlisters.SharedLister

2.1.3 等待pod集合

等待pod集合主要是存儲在Permit階段進行等待的pod,如果在等待周期中pod被刪除,則會直接拒絕

waitingPods           *waitingPodsMap

2.1.4 插件工廠註冊表

通過插件工廠來存儲所有註冊的插件工廠,然後通過插件工廠構建具體的插件

registry              Registry

2.2 插件工廠註冊表

2.2.1 插件工廠函數

工廠函數即傳入對應的參數,構建一個Plugin,其中FrameworkHandle主要是用於獲取快照和集群的其他數據

type PluginFactory = func(configuration *runtime.Unknown, f FrameworkHandle) (Plugin, error)

2.2.2 插件工廠的實現

在go裡面大多數插件工廠的實現都是通過map來實現這裡也是一樣,對外暴露Register和UnRegister接口

type Registry map[string]PluginFactory// Register adds a new plugin to the registry. If a plugin with the same name// exists, it returns an error.func (r Registry) Register(name string, factory PluginFactory) error {    if _, ok := r[name]; ok {        return fmt.Errorf("a plugin named %v already exists", name)    }    r[name] = factory    return nil}// Unregister removes an existing plugin from the registry. If no plugin with// the provided name exists, it returns an error.func (r Registry) Unregister(name string) error {    if _, ok := r[name]; !ok {        return fmt.Errorf("no plugin named %v exists", name)    }    delete(r, name)    return nil}// Merge merges the provided registry to the current one.func (r Registry) Merge(in Registry) error {    for name, factory := range in {        if err := r.Register(name, factory); err != nil {            return err        }    }    return nil}

2.3 插件註冊實現

這裡以preFilterPlugins為例展示整個流程的註冊

2.3.1 Plugins

Plugins在配置階段進行構造,其會保存當前framework中註冊的所有的插件,其通過PluginSet保存對應的允許和禁用的插件

type Plugins struct {    // QueueSort is a list of plugins that should be invoked when sorting pods in the scheduling queue.    QueueSort *PluginSet    // PreFilter is a list of plugins that should be invoked at "PreFilter" extension point of the scheduling framework.    PreFilter *PluginSet    // Filter is a list of plugins that should be invoked when filtering out nodes that cannot run the Pod.    Filter *PluginSet    // PostFilter is a list of plugins that are invoked after filtering out infeasible nodes.    PostFilter *PluginSet    // Score is a list of plugins that should be invoked when ranking nodes that have passed the filtering phase.    Score *PluginSet    // Reserve is a list of plugins invoked when reserving a node to run the pod.    Reserve *PluginSet    // Permit is a list of plugins that control binding of a Pod. These plugins can prevent or delay binding of a Pod.    Permit *PluginSet    // PreBind is a list of plugins that should be invoked before a pod is bound.    PreBind *PluginSet    // Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework.    // The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success.    Bind *PluginSet    // PostBind is a list of plugins that should be invoked after a pod is successfully bound.    PostBind *PluginSet    // Unreserve is a list of plugins invoked when a pod that was previously reserved is rejected in a later phase.    Unreserve *PluginSet}

2.3.2 插件集合映射

該方法主要是為了實現對應插件類型和framework中保存對應插件類型數組的映射, 比如Prefilter與其關聯的preFilterPlugins切片,string(插件類型)->[]PreFilterPlugin(&reflect.SliceHeader切片頭)

func (f *framework) getExtensionPoints(plugins *config.Plugins) []extensionPoint {    return []extensionPoint{        {plugins.PreFilter, &f.preFilterPlugins},        {plugins.Filter, &f.filterPlugins},        {plugins.Reserve, &f.reservePlugins},        {plugins.PostFilter, &f.postFilterPlugins},        {plugins.Score, &f.scorePlugins},        {plugins.PreBind, &f.preBindPlugins},        {plugins.Bind, &f.bindPlugins},        {plugins.PostBind, &f.postBindPlugins},        {plugins.Unreserve, &f.unreservePlugins},        {plugins.Permit, &f.permitPlugins},        {plugins.QueueSort, &f.queueSortPlugins},    }}

2.3.3 掃描註冊所有允許的插件

其會遍歷所有的上面的映射,但是此處不會根據類型註冊到對應的切片中,而是所有的註冊到gpMAp中

func (f *framework) pluginsNeeded(plugins *config.Plugins) map[string]config.Plugin {    pgMap := make(map[string]config.Plugin)    if plugins == nil {        return pgMap    }    // 構建匿名函數,利用閉包來修改pgMap保存所有允許的插件集合    find := func(pgs *config.PluginSet) {        if pgs == nil {            return        }        for _, pg := range pgs.Enabled { // 遍歷所有允許的插件集合            pgMap[pg.Name] = pg // 保存到map中        }    }    // 遍歷上面的所有映射表    for _, e := range f.getExtensionPoints(plugins) {        find(e.plugins)    }    return pgMap}

2.3.4 插件工廠構造插件映射

會調用生成的插件工廠註冊表,來通過每個插件的Factory構建Plugin插件實例, 保存到pluginsMap中

pluginsMap := make(map[string]Plugin)    for name, factory := range r {        // pg即上面生成的pgMap,這裡只會生成需要使用的插件        if _, ok := pg[name]; !ok {            continue        }        p, err := factory(pluginConfig[name], f)        if err != nil {            return nil, fmt.Errorf("error initializing plugin %q: %v", name, err)        }        pluginsMap[name] = p        // 進行權重保存        f.pluginNameToWeightMap[name] = int(pg[name].Weight)        if f.pluginNameToWeightMap[name] == 0 {            f.pluginNameToWeightMap[name] = 1        }        // Checks totalPriority against MaxTotalScore to avoid overflow        if int64(f.pluginNameToWeightMap[name])*MaxNodeScore > MaxTotalScore-totalPriority {            return nil, fmt.Errorf("total score of Score plugins could overflow")        }        totalPriority += int64(f.pluginNameToWeightMap[name]) * MaxNodeScore    }

2.3.5 按類型註冊插件

這裡主要是通過e.slicePtr利用反射,結合之前的構造的pluginsMap和反射來進行具體類型插件的註冊

for _, e := range f.getExtensionPoints(plugins) {        if err := updatePluginList(e.slicePtr, e.plugins, pluginsMap); err != nil {            return nil, err        }    }

updatePluginList主要是通過反射來進行的,通過上面的getExtensionPoints獲取的framework中對應的slice的地址,然後利用反射來進行插件的註冊和合法性效驗

func updatePluginList(pluginList interface{}, pluginSet *config.PluginSet, pluginsMap map[string]Plugin) error {    if pluginSet == nil {        return nil    }    // 首先通過Elem獲取當前數組的類型    plugins := reflect.ValueOf(pluginList).Elem()    // 通過數組類型來獲取數組內部元素的類型    pluginType := plugins.Type().Elem()    set := sets.NewString()    for _, ep := range pluginSet.Enabled {        pg, ok := pluginsMap[ep.Name]        if !ok {            return fmt.Errorf("%s %q does not exist", pluginType.Name(), ep.Name)        }        // 合法性檢查:如果發現當前插件未實現當前接口,則報錯        if !reflect.TypeOf(pg).Implements(pluginType) {            return fmt.Errorf("plugin %q does not extend %s plugin", ep.Name, pluginType.Name())        }        if set.Has(ep.Name) {            return fmt.Errorf("plugin %q already registered as %q", ep.Name, pluginType.Name())        }                set.Insert(ep.Name)        // 追加插件到slice中,並保存指針指向        newPlugins := reflect.Append(plugins, reflect.ValueOf(pg))        plugins.Set(newPlugins)    }    return nil}

2.4 CycleState

CycleState主要是負責調度流程中數據的保存和克隆,其對外暴露了讀寫鎖接口,各擴展點插件可以根據需求獨立進行加鎖選擇

2.4.1 數據結構

CycleState實現並複雜主要保存StateData數據,只需要實現一個clone接口即可,CycleState裡面的數據,可以被當前framework所有的插件進行數據增加和修改,裡面會通過讀寫鎖來保證線程安全,但並不會針對插件進行限制,即信任所有插件,可以任意進行增刪

type CycleState struct {    mx      sync.RWMutex    storage map[StateKey]StateData    // if recordPluginMetrics is true, PluginExecutionDuration will be recorded for this cycle.    recordPluginMetrics bool}// StateData is a generic type for arbitrary data stored in CycleState.type StateData interface {    // Clone is an interface to make a copy of StateData. For performance reasons,    // clone should make shallow copies for members (e.g., slices or maps) that are not    // impacted by PreFilter's optional AddPod/RemovePod methods.    Clone() StateData}

2.4.2 對外接口實現

對外接口的實現,需要對應的插件主動選擇進行加讀鎖或者加寫鎖,然後進行相關數據的讀取和修改

func (c *CycleState) Read(key StateKey) (StateData, error) {    if v, ok := c.storage[key]; ok {        return v, nil    }    return nil, errors.New(NotFound)}// Write stores the given "val" in CycleState with the given "key".// This function is not thread safe. In multi-threaded code, lock should be// acquired first.func (c *CycleState) Write(key StateKey, val StateData) {    c.storage[key] = val}// Delete deletes data with the given key from CycleState.// This function is not thread safe. In multi-threaded code, lock should be// acquired first.func (c *CycleState) Delete(key StateKey) {    delete(c.storage, key)}// Lock acquires CycleState lock.func (c *CycleState) Lock() {    c.mx.Lock()}// Unlock releases CycleState lock.func (c *CycleState) Unlock() {    c.mx.Unlock()}// RLock acquires CycleState read lock.func (c *CycleState) RLock() {    c.mx.RLock()}// RUnlock releases CycleState read lock.func (c *CycleState) RUnlock() {    c.mx.RUnlock()}

2.5 waitingPodMap與waitingPod

waitingPodMap主要是存儲Permit階段插件設置的需要Wait等待的pod,即時經過之前的優選後,這裡面的pod也可能會被某些插件給拒絕掉

2.5.1 數據結構

waitingPodsMAp其內部通過pod的uid保存一個map映射,同時通過讀寫鎖來進行數據保護

type waitingPodsMap struct {    pods map[types.UID]WaitingPod    mu   sync.RWMutex}

waitingPod則是一個具體的pod的等待實例,其內部通過pendingPlugins保存插件的定義的 timer等待時間,對外通過chan *status來接受當前pod的狀態,並通過讀寫鎖來進行串行化

type waitingPod struct {    pod            *v1.Pod    pendingPlugins map[string]*time.Timer    s              chan *Status    mu             sync.RWMutex}

2.5.2 構建waitingPod與計時器

會根據每個plugin的wait等待時間構建N個timer, 如果任一的timer到期,則就拒絕

func newWaitingPod(pod *v1.Pod, pluginsMaxWaitTime map[string]time.Duration) *waitingPod {    wp := &waitingPod{        pod: pod,        s:   make(chan *Status),    }    wp.pendingPlugins = make(map[string]*time.Timer, len(pluginsMaxWaitTime))    // The time.AfterFunc calls wp.Reject which iterates through pendingPlugins map. Acquire the    // lock here so that time.AfterFunc can only execute after newWaitingPod finishes.    wp.mu.Lock()    defer wp.mu.Unlock()    // 根據插件的等待時間來構建timer,如果有任一timer到期,還未曾有任何plugin Allow則會進行Rejectj㐇    for k, v := range pluginsMaxWaitTime {        plugin, waitTime := k, v        wp.pendingPlugins[plugin] = time.AfterFunc(waitTime, func() {            msg := fmt.Sprintf("rejected due to timeout after waiting %v at plugin %v",                waitTime, plugin)            wp.Reject(msg)        })    }    return wp}

2.5.3 停止定時器發送拒絕事件

任一一個plugin的定時器到期,或者plugin主動發起reject操作,則都會暫停所有的定時器,並進行消息廣播

func (w *waitingPod) Reject(msg string) bool {    w.mu.RLock()    defer w.mu.RUnlock()    // 停止所有的timer    for _, timer := range w.pendingPlugins {        timer.Stop()    }    // 通過管道發送拒絕事件    select {    case w.s <- NewStatus(Unschedulable, msg):        return true    default:        return false    }}

2.5.4 發送允許調度操作

允許操作必須等待所有的plugin都Allow後,才能發送允許事件

func (w *waitingPod) Allow(pluginName string) bool {    w.mu.Lock()    defer w.mu.Unlock()    if timer, exist := w.pendingPlugins[pluginName]; exist {        // 停止當前plugin的定時器        timer.Stop()        delete(w.pendingPlugins, pluginName)    }    // Only signal success status after all plugins have allowed    if len(w.pendingPlugins) != 0 {        return true    }    // 只有當所有的plugin都允許,才會發生成功允許事件    select {    case w.s <- NewStatus(Success, ""): // 發送事件        return true    default:        return false    }}

2.5.5 Permit階段Wait實現

首先會遍歷所有的插件,然後如果發現狀態設置為Wait,則會根據插件的等待時間進行wait操作

func (f *framework) RunPermitPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, nodeName string) (status *Status) {    startTime := time.Now()    defer func() {        metrics.FrameworkExtensionPointDuration.WithLabelValues(permit, status.Code().String()).Observe(metrics.SinceInSeconds(startTime))    }()    pluginsWaitTime := make(map[string]time.Duration)    statusCode := Success    for _, pl := range f.permitPlugins {        status, timeout := f.runPermitPlugin(ctx, pl, state, pod, nodeName)        if !status.IsSuccess() {            if status.IsUnschedulable() {                msg := fmt.Sprintf("rejected by %q at permit: %v", pl.Name(), status.Message())                klog.V(4).Infof(msg)                return NewStatus(status.Code(), msg)            }            if status.Code() == Wait {                // Not allowed to be greater than maxTimeout.                if timeout > maxTimeout {                    timeout = maxTimeout                }                // 記錄當前plugin的等待時間                pluginsWaitTime[pl.Name()] = timeout                statusCode = Wait            } else {                msg := fmt.Sprintf("error while running %q permit plugin for pod %q: %v", pl.Name(), pod.Name, status.Message())                klog.Error(msg)                return NewStatus(Error, msg)            }        }    }    // We now wait for the minimum duration if at least one plugin asked to    // wait (and no plugin rejected the pod)    if statusCode == Wait {        startTime := time.Now()        // 根據插件等待時間構建waitingPod        w := newWaitingPod(pod, pluginsWaitTime)        // 加入到waitingPods中        f.waitingPods.add(w)        // 移除        defer f.waitingPods.remove(pod.UID)        klog.V(4).Infof("waiting for pod %q at permit", pod.Name)        // 等待狀態消息        s := <-w.s        metrics.PermitWaitDuration.WithLabelValues(s.Code().String()).Observe(metrics.SinceInSeconds(startTime))        if !s.IsSuccess() {            if s.IsUnschedulable() {                msg := fmt.Sprintf("pod %q rejected while waiting at permit: %v", pod.Name, s.Message())                klog.V(4).Infof(msg)                return NewStatus(s.Code(), msg)            }            msg := fmt.Sprintf("error received while waiting at permit for pod %q: %v", pod.Name, s.Message())            klog.Error(msg)            return NewStatus(Error, msg)        }    }    return nil}

2.6 插件調用方法實現概覽

上面已經將插件進行註冊,並且介紹了調度流程中數據的保存和等待機制的實現,其實剩下的就是每類插件執行調用的具體實現了,除了優選階段,其實剩下的階段,都是幾乎沒有什麼邏輯處理了,而優選階段就跟之前系列分享裡面的優選階段的設計類似,這裡也不在進行贅述了

2.6.1RunPreFilterPlugins

流程看起來都蠻簡單的,注意這個地方有任一一個插件拒絕,則就會直接調度失敗

func (f *framework) RunPreFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod) (status *Status) {    startTime := time.Now()    defer func() {        metrics.FrameworkExtensionPointDuration.WithLabelValues(preFilter, status.Code().String()).Observe(metrics.SinceInSeconds(startTime))    }()    for _, pl := range f.preFilterPlugins {        status = f.runPreFilterPlugin(ctx, pl, state, pod)        if !status.IsSuccess() {            if status.IsUnschedulable() {                msg := fmt.Sprintf("rejected by %q at prefilter: %v", pl.Name(), status.Message())                klog.V(4).Infof(msg)                return NewStatus(status.Code(), msg)            }            msg := fmt.Sprintf("error while running %q prefilter plugin for pod %q: %v", pl.Name(), pod.Name, status.Message())            klog.Error(msg)            return NewStatus(Error, msg)        }    }    return nil}

2.6.2 RunFilterPlugins

跟之前的類似,只不過會根據runAllFilters參數確定是否要運行所有的插件,默認是不運行,因為已經失敗了了嘛

unc (f *framework) RunFilterPlugins(    ctx context.Context,    state *CycleState,    pod *v1.Pod,    nodeInfo *schedulernodeinfo.NodeInfo,) PluginToStatus {    var firstFailedStatus *Status    startTime := time.Now()    defer func() {        metrics.FrameworkExtensionPointDuration.WithLabelValues(filter, firstFailedStatus.Code().String()).Observe(metrics.SinceInSeconds(startTime))    }()    statuses := make(PluginToStatus)    for _, pl := range f.filterPlugins {        pluginStatus := f.runFilterPlugin(ctx, pl, state, pod, nodeInfo)        if len(statuses) == 0 {            firstFailedStatus = pluginStatus        }        if !pluginStatus.IsSuccess() {            if !pluginStatus.IsUnschedulable() {                // Filter plugins are not supposed to return any status other than                // Success or Unschedulable.                firstFailedStatus = NewStatus(Error, fmt.Sprintf("running %q filter plugin for pod %q: %v", pl.Name(), pod.Name, pluginStatus.Message()))                return map[string]*Status{pl.Name(): firstFailedStatus}            }            statuses[pl.Name()] = pluginStatus            if !f.runAllFilters {                // 不需要運行所有插件進行退出                return statuses            }        }    }    return statuses}

今天就到這裡吧,調度器修改還是蠻大的,但是可以預見的是,為了更多的調度插件可能都會集中到framework中,對kubernetes scheduler系列的學習,也算是告一段落了,作為一個kubernetes新手學習起來還是有點費勁,還好調度器設計的跟其他模塊的耦合性相對小一點