Kubernetes Eviction Manager怎么启动

132次阅读

共计 17072 个字符，预计需要花费 43 分钟才能阅读完成。

这篇文章主要讲解了“Kubernetes Eviction Manager 怎么启动”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着丸趣 TV 小编的思路慢慢深入，一起来研究和学习“Kubernetes Eviction Manager 怎么启动”吧！

Kubernetes Eviction Manager 源码分析 Kubernetes Eviction Manager 在何处启动

Kubelet 在实例化一个 kubelet 对象的时候，调用 eviction.NewManager 新建了一个 evictionManager 对象。

pkg/kubelet/kubelet.go:273
func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) {thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
 if err != nil {
 return nil, err
 evictionConfig := eviction.Config{
 PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
 MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
 Thresholds: thresholds,
 KernelMemcgNotification: kubeCfg.ExperimentalKernelMemcgNotification,
 // setup eviction manager
 evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock)
 if err != nil {return nil, fmt.Errorf( failed to initialize eviction manager: %v , err)
 klet.evictionManager = evictionManager
 klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
}

kubelet 执行 Run 方法开始工作时，启动了一个 goroutine，每 5s 执行一次 updateRuntimeUp。在 updateRuntimeUp 中，待确认 runtime 启动成功后，会调用 initializeRuntimeDependentModules 完成 runtime 依赖模块的初始化工作。

pkg/kubelet/kubelet.go:1219
func (kl *Kubelet) Run(updates  -chan kubetypes.PodUpdate) {go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)

}

再跟踪到 initializeRuntimeDependentModules 的代码可见，runtime 的依赖模块包括 cadvisor 和 evictionManager，初始化的工作其实就是分别调用它们的 Start 方法进行启动。

pkg/kubelet/kubelet.go:1206
func (kl *Kubelet) initializeRuntimeDependentModules() {if err := kl.cadvisor.Start(); err != nil {
 // Fail kubelet and rely on the babysitter to retry starting kubelet.
 // TODO(random-liu): Add backoff logic in the babysitter
 glog.Fatalf(Failed to start cAdvisor %v , err)
 // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
 if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod); err != nil {kl.runtimeState.setInternalError(fmt.Errorf( failed to start eviction manager %v , err))
}

因此，从这里开始就进入到 evictionManager 的分析了。

Kubernetes Eviction Manager 的定义

从上面的分析可见，kubelet 在启动过程中进行 runtime 依赖模块的初始化过程中，将 evictionManager 启动了。先别急，我们必须先来看看 Eviction Manager 是如何定义的。

pkg/kubelet/eviction/eviction_manager.go:40
// managerImpl implements Manager
type managerImpl struct {
 // used to track time
 clock clock.Clock
 // config is how the manager is configured
 config Config
 // the function to invoke to kill a pod
 killPodFunc KillPodFunc
 // the interface that knows how to do image gc
 imageGC ImageGC
 // protects access to internal state
 sync.RWMutex
 // node conditions are the set of conditions present
 nodeConditions []v1.NodeConditionType
 // captures when a node condition was last observed based on a threshold being met
 nodeConditionsLastObservedAt nodeConditionsObservedAt
 // nodeRef is a reference to the node
 nodeRef *v1.ObjectReference
 // used to record events about the node
 recorder record.EventRecorder
 // used to measure usage stats on system
 summaryProvider stats.SummaryProvider
 // records when a threshold was first observed
 thresholdsFirstObservedAt thresholdsObservedAt
 // records the set of thresholds that have been met (including graceperiod) but not yet resolved
 thresholdsMet []Threshold
 // resourceToRankFunc maps a resource to ranking function for that resource.
 resourceToRankFunc map[v1.ResourceName]rankFunc
 // resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
 resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs
 // last observations from synchronize
 lastObservations signalObservations
 // notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once)
 notifiersInitialized bool
}

managerImpl 就是 evictionManager 的具体定义，重点关注：

config – evictionManager 的配置，包括:

PressureTransitionPeriod(–eviction-pressure-transition-period)

MaxPodGracePeriodSeconds(–eviction-max-pod-grace-period)

Thresholds(–eviction-hard, –eviction-soft)

KernelMemcgNotification(–experimental-kernel-memcg-notification)

killPodFunc – evict pod 时 kill pod 的接口，kubelet NewManager 的时候，赋值为 killPodNow 方法(pkg/kubelet/pod_workers.go:285)

imageGC – 当 node 出现 diskPressure condition 时，imageGC 进行 unused images 删除操作以回收 disk space。

summaryProvider – 提供 node 和 node 上所有 pods 的最新 status 数据汇总，既 NodeStats and []PodStats。

thresholdsFirstObservedAt – 记录 threshold 第一次观察到的时间。

thresholdsMet – 保存已经触发但还没解决的 Thresholds，包括那些处于 grace period 等待阶段的 Thresholds。

resourceToRankFunc – 定义各种 Resource 进行 evict 挑选时的排名方法。

resourceToNodeReclaimFuncs – 定义各种 Resource 进行回收时调用的方法。

lastObservations – 上一次获取的 eviction signal 的记录，确保每次更新 thresholds 时都是按照正确的时间序列进行。

notifierInitialized – bool 值，表示 threshold notifier 是否已经初始化，以确定是否可以利用 kernel memcg notification 功能来提高 evict 的响应速度。目前创建 manager 时该值为 false，是否要利用 kernel memcg notification，完全取决于 kubelet 的 –experimental-kernel-memcg-notification 参数。

kubelet 在 NewMainKubelet 时调用 eviction.NewManager 进行 evictionManager 的创建，eviction.NewManager 的代码很简单，就是赋值。

pkg/kubelet/eviction/eviction_manager.go:79
// NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration.
func NewManager(
 summaryProvider stats.SummaryProvider,
 config Config,
 killPodFunc KillPodFunc,
 imageGC ImageGC,
 recorder record.EventRecorder,
 nodeRef *v1.ObjectReference,
 clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) {
 manager :=  managerImpl{
 clock: clock,
 killPodFunc: killPodFunc,
 imageGC: imageGC,
 config: config,
 recorder: recorder,
 summaryProvider: summaryProvider,
 nodeRef: nodeRef,
 nodeConditionsLastObservedAt: nodeConditionsObservedAt{},
 thresholdsFirstObservedAt: thresholdsObservedAt{},
 return manager, manager, nil
}

但是，有一点很重要，NewManager 不但返回 evictionManager 对象，还返回了一个 lifecycle.PodAdmitHandler 实例 evictionAdmitHandler，它其实和 evictionManager 的内容相同，但是不同的两个实例。evictionAdmitHandler 用来 kubelet 创建 Pod 前进行准入检查，满足条件后才会继续创建 Pod，通过 Admit(attrs *lifecycle.PodAdmitAttributes)方法来检查，代码如下：

pkg/kubelet/eviction/eviction_manager.go:102
// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {m.RLock()
 defer m.RUnlock()
 if len(m.nodeConditions) == 0 {return lifecycle.PodAdmitResult{Admit: true}
 // the node has memory pressure, admit if not best-effort
 if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod)
 if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) {return lifecycle.PodAdmitResult{Admit: true}
 // reject pods when under memory pressure (if pod is best effort), or if under disk pressure.
 glog.Warningf(Failed to admit pod %v - %s , format.Pod(attrs.Pod),  node has conditions: %v , m.nodeConditions)
 return lifecycle.PodAdmitResult{
 Admit: false,
 Reason: reason,
 Message: fmt.Sprintf(message, m.nodeConditions),
}

上述 Pod Admit 逻辑，正是 Kubernetes Eviction Manager 工作机制分析中 Scheduler 一节提到的 EvictionManager 对 Pod 调度的逻辑影响：

Kubelet 会定期的将 Node Condition 传给 kube-apiserver 并存于 etcd。kube-scheduler watch 到 Node Condition Pressure 之后，会根据以下策略，阻止更多 Pods Bind 到该 Node。

Node ConditionScheduler BehaviorMemoryPressureNo new BestEffort pods are scheduled to the node.DiskPressureNo new pods are scheduled to the node.

killPodNow 的代码，后面再分析。

基本上，这一小节我们把 evictionManager 是什么以及怎么来的问题搞清楚了。下面我们来看看 evictionManager 的启动过程。

Kubernetes Eviction Manager 的启动

上面分析过，kubelet 在启动过程中进行 runtime 依赖模块的初始化过程中，将 evictionManager 启动了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)), 那我们先来看看 Start 方法：

pkg/kubelet/eviction/eviction_manager.go:126
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error {
 // start the eviction manager monitoring
 go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop)
 return nil
}

很简单，启动一个 goroutine，每执行完一次 m.synchronize 就间隔 monitoringInterval(10s)的时间再次执行 m.synchronize，如此反复。

接下来，就是 evictionManager 的关键工作流程了：

pkg/kubelet/eviction/eviction_manager.go:181
// synchronize is the main control loop that enforces eviction thresholds.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) {
 // if we have nothing to do, just return
 thresholds := m.config.Thresholds
 if len(thresholds) == 0 {
 return
 // build the ranking functions (if not yet known)
 if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 {
 // this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass.
 hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs()
 if err != nil {
 return
 m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs)
 m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs)
 // make observations and get a function to derive pod usage stats relative to those observations.
 observations, statsFunc, err := makeSignalObservations(m.summaryProvider)
 if err != nil {glog.Errorf( eviction manager: unexpected err: %v , err)
 return
 // attempt to create a threshold notifier to improve eviction response time
 if m.config.KernelMemcgNotification   !m.notifiersInitialized {glog.Infof( eviction manager attempting to integrate with kernel memcg notification api)
 m.notifiersInitialized = true
 // start soft memory notification
 err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {glog.Infof( soft memory eviction threshold crossed at %s , desc)
 // TODO wait grace period for soft memory limit
 m.synchronize(diskInfoProvider, podFunc)
 if err != nil {glog.Warningf( eviction manager: failed to create hard memory threshold notifier: %v , err)
 // start hard memory notification
 err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {glog.Infof( hard memory eviction threshold crossed at %s , desc)
 m.synchronize(diskInfoProvider, podFunc)
 if err != nil {glog.Warningf( eviction manager: failed to create soft memory threshold notifier: %v , err)
 // determine the set of thresholds met independent of grace period
 thresholds = thresholdsMet(thresholds, observations, false)
 // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
 if len(m.thresholdsMet)   0 {thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
 thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
 // determine the set of thresholds whose stats have been updated since the last sync
 thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)
 // track when a threshold was first observed
 now := m.clock.Now()
 thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)
 // the set of node conditions that are triggered by currently observed thresholds
 nodeConditions := nodeConditions(thresholds)
 // track when a node condition was last observed
 nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)
 // node conditions report true if it has been observed within the transition period window
 nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)
 // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
 thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)
 // update internal state
 m.Lock()
 m.nodeConditions = nodeConditions
 m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
 m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
 m.thresholdsMet = thresholds
 m.lastObservations = observations
 m.Unlock()
 // determine the set of resources under starvation
 starvedResources := getStarvedResources(thresholds)
 if len(starvedResources) == 0 {glog.V(3).Infof(eviction manager: no resources are starved)
 return
 // rank the resources to reclaim by eviction priority
 sort.Sort(byEvictionPriority(starvedResources))
 resourceToReclaim := starvedResources[0]
 glog.Warningf(eviction manager: attempting to reclaim %v , resourceToReclaim)
 // determine if this is a soft or hard eviction associated with the resource
 softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)
 // record an event about the resources we are now attempting to reclaim via eviction
 m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning,  EvictionThresholdMet ,  Attempting to reclaim %s , resourceToReclaim)
 // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
 if m.reclaimNodeLevelResources(resourceToReclaim, observations) {glog.Infof( eviction manager: able to reduce %v pressure without evicting pods. , resourceToReclaim)
 return
 glog.Infof(eviction manager: must evict pod(s) to reclaim %v , resourceToReclaim)
 // rank the pods for eviction
 rank, ok := m.resourceToRankFunc[resourceToReclaim]
 if !ok {glog.Errorf( eviction manager: no ranking function for resource %s , resourceToReclaim)
 return
 // the only candidates viable for eviction are those pods that had anything running.
 activePods := podFunc()
 if len(activePods) == 0 {glog.Errorf( eviction manager: eviction thresholds have been met, but no pods are active to evict)
 return
 // rank the running pods for eviction for the specified resource
 rank(activePods, statsFunc)
 glog.Infof(eviction manager: pods ranked for eviction: %s , format.Pods(activePods))
 // we kill at most a single pod during each eviction interval
 for i := range activePods {pod := activePods[i]
 status := v1.PodStatus{
 Phase: v1.PodFailed,
 Message: fmt.Sprintf(message, resourceToReclaim),
 Reason: reason,
 // record that we are evicting the pod
 m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
 gracePeriodOverride := int64(0)
 if softEviction {
 gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
 // this is a blocking call and should only return when the pod and its containers are killed.
 err := m.killPodFunc(pod, status,  gracePeriodOverride)
 if err != nil {glog.Infof( eviction manager: pod %s failed to evict %v , format.Pod(pod), err)
 continue
 // success, so we return until the next housekeeping interval
 glog.Infof(eviction manager: pod %s evicted successfully , format.Pod(pod))
 return
 glog.Infof(eviction manager: unable to evict any pods from the node)
}

代码写的非常工整，注释也很到位，很棒。关键流程如下：

通过 buildResourceToRankFunc 和 buildResourceToNodeReclaimFuncs 分别注册 Evict Pod 时各种 Resource 的排名函数和回收 Node Resource 的 Reclaim 函数。

通过 makeSignalObservations 从 cAdvisor 中获取 Eviction Signal Observation 和 Pod 的 StatsFunc(后续对 Pods 进行 Rank 时需要用)。

如果 kubelet 配置了 –experimental-kernel-memcg-notification 且为 true，则通过 startMemoryThresholdNotifier 启动 soft hard memory notification，当 system usage 第一时间达到 soft hard memory thresholds 时，会立刻通知 kubelet，并触发 evictionManager.synchronize 进行资源回收的流程。这样提高了 eviction 的实时性。

根据从 cAdvisor 数据计算得到的 Observation（observasions）和配置的 thresholds 通过 thresholdsMet 计算得到此次 Met 的 thresholds。

再根据从 cAdvisor 数据计算得到的 Observation（observasions）和 thresholdsMet 通过 thresholdsMet 计算得到已记录但还没解决的 thresholds，然后与上一步中的 thresholds 进行合并。

根据 lastObservations 中 Signal 的时间，对比 observasions 的中 Signal 中的时间，过滤 thresholds。

更新 thresholdsFirstObservedAt, nodeConditions。

过滤出那些从 observed time 到 now，已经历过 grace period 时间的 thresholds。

更新 evictionManager 对象的内部数据: nodeConditions，thresholdsFirstObservedAt，nodeConditionsLastObservedAt，thresholds，observations。

根据 thresholds 得到 starvedResources，并进行排序，如果 memory 属于 starvedResources，则 memory 排序第一。

取 starvedResources 排第一的 Resource，调用 reclaimNodeLevelResources 对 Node 上这种 Resource 进行资源回收。如果回收完后，available 满足 thresholdValue+evictionMinimumReclaim, 则流程结束，不再 evict user-pods。

如果 reclaimNodeLevelResources 后，还不足以达到要求，则会继续 evict user-pods，首先根据前面 buildResourceToRankFunc 注册的方法对所有 active Pods 进行排序。

按照前面的排序，顺序的调用 killPodNow 将选出的 pod 干掉。如果 kill 某个 pod 失败，则会跳过这个 pod，再按顺序挑下一个 pod 进行 kill。只要某个 pod kill 成功，就返回结束，也就是说这个流程中，最多只会 kill 最多一个 Pod。

上面流程中，有两个最关键的步骤，回收节点资源 (reclaimNodeLevelResources) 和 evict user-pods(killPodNow)。

pkg/kubelet/eviction/eviction_manager.go:340

// reclaimNodeLevelResources attempts to reclaim node level resources. returns true if thresholds were satisfied and no pod eviction is required.

func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool {

 nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim]

 for _, nodeReclaimFunc := range nodeReclaimFuncs {

 // attempt to reclaim the pressured resource.

 reclaimed, err := nodeReclaimFunc()

 if err == nil {

 // update our local observations based on the amount reported to have been reclaimed.

 // note: this is optimistic, other things could have been still consuming the pressured resource in the interim.

 signal := resourceToSignal[resourceToReclaim]

 value, ok := observations[signal]

 if !ok {

 glog.Errorf(eviction manager: unable to find value associated with signal %v , signal)

 continue

 value.available.Add(*reclaimed)

 // evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals

 if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 {

 return true

 } else {

 glog.Errorf(eviction manager: unexpected error when attempting to reduce %v pressure: %v , resourceToReclaim, err)

 return false