Kubernetes的PDB怎么应用

147次阅读

共计 13412 个字符，预计需要花费 34 分钟才能阅读完成。

这篇文章主要介绍“Kubernetes 的 PDB 怎么应用”，在日常操作中，相信很多人在 Kubernetes 的 PDB 怎么应用问题上存在疑惑，丸趣 TV 小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”Kubernetes 的 PDB 怎么应用”的疑惑有所帮助！接下来，请跟着丸趣 TV 小编一起来学习吧！

PDB 的应用场景

大概在 Kubernetes 1.4 新增了 PodDisruptionBudget Object（后面简称 PDB），在 1.5 的时候升级到 Beta，但是直到 1.9 Released 还是 Beta。不过没关系，我们抛开这些，先来想想 PDB 是为了解决什么问题的。PDB Feature 已经一年多了，以前没有研究过它，主要是没场景。最近在做基于 Kubernetes 的 ElasticSearch as a Service(简称 ESaaS) 项目方案，要尽量保证任何 ElasticSearch Cluster 中始终至少要有一个健康可用的 ES client pod, ES master pod 和 ES data pod。很多同学都学想到 Deployment 中可以设置 maxUnavailable，那不就行了吗？再说了，还会有 RS Controller 在做副本控制呢？

等下！Deployment 中的 maxUnavailable 是什么时候用的？—— 是用来对使用 Deployment 部署的应用进行滚动更新时保障最少可服务副本数的！RS Controller 呢？—— 那只是副本控制器之一，它并不能给你保证集群中始终有几个副本的，它是负责尽快的让实际副本数跟你的期望副本数相同的，它才不管中间某些时刻的实际副本数呢。这个时候，你就可以考虑使用 Kubernetes PDB 了，它是用来保证应用的高可用的，对那些 Voluntary（自愿的）Disruption 做好 Budgets(预算方案)。

前面提到了 Voluntary Disruption，我们来捋一下，什么是 Voluntary Disruption？什么又是 Involuntary Disruption？

Involuntary Disruption 及其应对措施

Involuntary Disruption 指的是那些不可控的（或者目前来说难于控制的）外界因素导致的 Disruption，比如：

服务器的硬件故障或者内核崩溃导致节点 Down 了。

如果容器部署在 VM，VM 被误删了或者 Hyperwisor 出问题了。

集群出现了网络脑裂。（Kubernetes 通过 NodeController 来处理网络脑裂情况，但是 evict pods 时仍然没有考虑到保证应用的高可用）关于 NodeController 深度解析，请参考我的下面博文：

Kubernetes Node Controller 源码分析之执行篇

Kubernetes Node Controller 源码分析之创建篇

Kubernetes Node Controller 源码分析之配置篇

Kubernetes Node Controller 源码分析之 Taint Controller

某个节点因为不合理的超配导致出现计算资源不足时，触发 kubelet eviction 时也没有考虑到保证应用的高可用。关于 kubelet eviction 深度解析，请参考我的下面博文：

Kubernetes Eviction Manager 源码分析

Kubernetes Eviction Manager 工作机制分析

PDB 不是解决 Involuntary Disruption 的，我们如何在使用 Kubernetes 时尽量减轻或者缓解 Involuntary Disruption 对应用高可用的影响呢？

一个应用尽量使用 Deployment,RS,StatefulSet 等副本控制器部署，并且 replicas 大于 1。

设置应用 container 的 request 值，使得即使在资源非常紧张的情况下，也能有足够的资源供它使用。

另外，尽量考虑物理设备上的 HA，比如一个应用的不同副本要跨服务器部署，跨机柜跨机架部署，跨交换机部署等。

PDB 是为了 Voluntary Disruption 时保障应用的高可用

Involuntary Disruption 对立的场景，自然就是 Voluntary Disruption 了，指的是用户或者集群管理员触发的，Kubernetes 可控的 Disruption 场景，比如：

删除那些管理 Pods 的控制器，比如 Deployment，RS，RC，StatefulSet。

触发应用的滚动更新。

直接批量删除 Pods。

kubectl drain 一个节点（节点下线、集群缩容）

PDB 就是针对 Voluntary Disruption 场景设计的，属于 Kubernetes 可控的范畴之一，而不是为 Involuntary Disruption 设计的。

Kube-Node 项目上线后，可以支持对接 Openstack，AWS，GCE 等 cloud provider 实现 Node 的自动管理，因此可能会经常有 HNA(Horizontal Node Autoscaleer) 事件, 工作流就有类似 drain a node 的逻辑，因此需要使用 PDB 来保障应用的 HA。

PDB 的使用方法及注意事项使用说明及注意点

部署在 Kubernetes 的每个 App 都可以创建一个对应 PDB Object，用来限制 Voluntary Disruptions 时最大可以 down 的副本数或者最少应该保持 Available 的副本数，以此来保证应用的高可用。

PDB 可以用来保护由 Kubernetes 内置控制器管理的应用，这种情况下要求 DPB selector 等同于这些 Controller Object 的 Selector：

Deployment

ReplicationController

ReplicaSet

StatefulSet

也可以用来保护那些仅仅由 PDB Selector 自己选择的 Pods Set，但是有两个使用限制：

只能配置.spec.minAvailable, 不能使用 maxUnavailable;

.spec.minAvailable 只能为整型值，不能是百分比。

因此，不管怎么说，PDB 影响的 Pods Set 都是通过自己的 Selector 来选择的，使用时要注意同一个 namespace 下不同的 PDB Object 不要使用有重叠的 Selectors。

在使用 PDB 时，你需要弄清楚你的应用类型以及你想要的应对措施：

无状态应用：比如想至少有 60% 的副本 Available。

解决办法：创建 PDB Object，指定 minAvailable 为 60%，或者 maxUnavailable 为 40%。

单实例的有状态应用：终止这个实例之前必须提前通知客户并取得同意。

解决办法：创建 PDB Object，并设置 maxUnavailable 为 0，这样 Kubernetes 就会阻止这个实例的删除，然后去通知并征求用户同意后，再把这个 PDB 删除从而解除这个阻止，然后再去 recreate。单实例的 statefulset 的滚动更新一定会有服务停止时间，因此建议生产环境不要创建单实例的 StatefulSet。

多实例的有状态应用：最少可用的实例数不能少于某个数 N（比如受限于 raft 协议类应用的选举机制）

解决办法：设置 maxUnavailable= 1 或者 minAvailable=N, 分别允许每次只删除一个实例和每次删除 expected_replicas – minAvailable 个实例。

批处理 Job：Job 需要最终有一个 Pod 成功完成任务。

Job Controller 有自己的机制保证这个，不需要创建 PDB。

关于 Job Controller 深入解读，请参考我的博文：Kubernetes Job Controller 源码分析

定义 PDB Object

进行了以上思考后，确定了要创建 PDB，接下来就看看 PodDisruptionBudget 怎么定义的，下面是个 Sample：

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 name: zk-pdb
spec:
 minAvailable: 2
 selector:
 matchLabels:
 app: zookeeper

PDB 的定义，其实就三项关键内容：

.spec.selector 用来选择后端 Pods Set，最佳实践是与应用对应的 Deployment,StatefulSet 的 Selector 一致；

.spec.minAvailable 表示发生 voluntary disruptions 的过程中，要保证至少可用的 Pods 数或者比例；

.spec.maxUnavailable 表示发生 voluntary disruptions 的过程中，要保证最大不可用的 Pods 数或者比例，要求 Kubernetes version = 1.7；这个配置只能用来对应 Deployment，RS，RC，StatefulSet 的 Pods，推荐优先使用.spec.maxUnavailable。

注意:

同一个 PDB Object 中不能同时定义.spec.minAvailable 和.spec.maxUnavailable。

前面提到，应用滚动更新时 Pod 的 delete 和 unavailable 虽然也属于 voluntary disruption，但是实际上滚动更新有自己的策略控制（marSurge 和 maxUnavailable），因此 PDB 不会干预这个过程。

PDB 只能保证 voluntary disruptions 时的副本数，比如 evict pod 过程中刚好满足.spec.minAvailable 或.spec.maxUnavailable，这时某个本来正常的 Pod 突然因为 Node Down(Involuntary Disruption) 了挂了，那么这个时候实际 Pods 数就比 PDB 中要求的少了，因此 PDB 不是万能的！

使用上，如果设置.spec.minAvailable 为 100% 或者.spec.maxUnavailable 为 0%，意味着会完全阻止 evict pods 的过程（Deployment 和 StatefulSet 的滚动更新除外）。

创建 PDB Object

kubectl apply -f zk-pdb.yaml 创建该 PDB Object；

$ kubectl get poddisruptionbudgets
NAME MIN-AVAILABLE ALLOWED-DISRUPTIONS AGE
zk-pdb 2 1 7s

kubect get pdb zk-pdb -o yaml 查看：

$ kubectl get poddisruptionbudgets zk-pdb -o yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 creationTimestamp: 2017-08-28T02:38:26Z
 generation: 1
 name: zk-pdb
status:
 currentHealthy: 3
 desiredHealthy: 3
 disruptedPods: null
 disruptionsAllowed: 1
 expectedPods: 3
 observedGeneration: 1

PDB 的工作原理及源码分析

PDB Object 定义是遇到 voluntary disruption 时用户的期望状态，真正去维护这个期望状态的也是一个由 kube-controller-manager 管理的 Controller，那便是 Disruption Controller。

Disruption Controller 主要 watch Pods 和 PDBs，当监听到 pod/pdb 的 Add/Del/Update 事件后，并会将对应的 pdb object 放到 rate limit queue 中等待 worker 处理，worker 的主要逻辑就是计算 PodDisruptionBudgetStatus 的 currentHealthy, desiredHealthy, expectedCount, disruptedPods, 然后调用 api 更新 PDB Status。

pkg/controller/disruption/disruption.go:498
func (dc *DisruptionController) trySync(pdb *policy.PodDisruptionBudget) error {pods, err := dc.getPodsForPdb(pdb)
 if err != nil {dc.recorder.Eventf(pdb, v1.EventTypeWarning,  NoPods ,  Failed to get pods: %v , err)
 return err
 if len(pods) == 0 {dc.recorder.Eventf(pdb, v1.EventTypeNormal,  NoPods ,  No matching pods found)
 expectedCount, desiredHealthy, err := dc.getExpectedPodCount(pdb, pods)
 if err != nil {dc.recorder.Eventf(pdb, v1.EventTypeWarning,  CalculateExpectedPodCountFailed ,  Failed to calculate the number of expected pods: %v , err)
 return err
 currentTime := time.Now()
 disruptedPods, recheckTime := dc.buildDisruptedPodMap(pods, pdb, currentTime)
 currentHealthy := countHealthyPods(pods, disruptedPods, currentTime)
 err = dc.updatePdbStatus(pdb, currentHealthy, desiredHealthy, expectedCount, disruptedPods)
 if err == nil   recheckTime != nil {
 // There is always at most one PDB waiting with a particular name in the queue,
 // and each PDB in the queue is associated with the lowest timestamp
 // that was supplied when a PDB with that name was added.
 dc.enqueuePdbForRecheck(pdb, recheckTime.Sub(currentTime))
 return err
}

下面是 PodDisruptionBudgetStatus 的定义：

pkg/apis/policy/types.go:48
type PodDisruptionBudgetStatus struct {
 // Most recent generation observed when updating this PDB status. PodDisruptionsAllowed and other
 // status informatio is valid only if observedGeneration equals to PDB s object generation.
 // +optional
 ObservedGeneration int64 `json: observedGeneration,omitempty  protobuf: varint,1,opt,name=observedGeneration `
 // DisruptedPods contains information about pods whose eviction was
 // processed by the API server eviction subresource handler but has not
 // yet been observed by the PodDisruptionBudget controller.
 // A pod will be in this map from the time when the API server processed the
 // eviction request to the time when the pod is seen by PDB controller
 // as having been marked for deletion (or after a timeout). The key in the map is the name of the pod
 // and the value is the time when the API server processed the eviction request. If
 // the deletion didn t occur and a pod is still there it will be removed from
 // the list automatically by PodDisruptionBudget controller after some time.
 // If everything goes smooth this map should be empty for the most of the time.
 // Large number of entries in the map may indicate problems with pod deletions.
 DisruptedPods map[string]metav1.Time `json: disruptedPods  protobuf: bytes,2,rep,name=disruptedPods `
 // Number of pod disruptions that are currently allowed.
 PodDisruptionsAllowed int32 `json: disruptionsAllowed  protobuf: varint,3,opt,name=disruptionsAllowed `
 // current number of healthy pods
 CurrentHealthy int32 `json: currentHealthy  protobuf: varint,4,opt,name=currentHealthy `
 // minimum desired number of healthy pods
 DesiredHealthy int32 `json: desiredHealthy  protobuf: varint,5,opt,name=desiredHealthy `
 // total number of pods counted by this disruption budget
 ExpectedPods int32 `json: expectedPods  protobuf: varint,6,opt,name=expectedPods `
}

PodDisruptionBudgetStatus 最重要的元素就是 **DisruptedPods 和 PodDisruptionsAllowed**：

DisruptedPods：用来保存那些已经通过 apiserver pod eviction subresource 处理的 pods，但是还没被 PDB Controller 发现处理的 Pods，是 Map 类型，key 为 Pod Name，value 是 apiserver 接受 eviction subresource 请求的时间。加入里面的 Pod 有 2min 的超时时间，如果 2min 后 Pod 仍然没有被删除，则会将该 Pod 从队列中剔除。

PodDisruptionsAllowed：表示当前允许 Disruption 的 Pods 数。

Disruption Controller 的主要逻辑就是更新 PDB.Status，那么问题来了，到底是谁去控制 voluntary distribution 时 eviction 的 maxUnavailable 或者 minAvailable 的呢？

要再次提醒的是，PDB Controller 只处理那些通过 pod eviction subresource 请求对应的 pods，因此上面的这个问题就要到对应的 Pod 的 evictionRest 中去找了。

pkg/registry/core/pod/storage/eviction.go:81
// Create attempts to create a new eviction. That is, it tries to evict a pod.
func (r *EvictionREST) Create(ctx genericapirequest.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc, includeUninitialized bool) (runtime.Object, error) {eviction := obj.(*policy.Eviction)
 obj, err := r.store.Get(ctx, eviction.Name,  metav1.GetOptions{})
 if err != nil {
 return nil, err
 pod := obj.(*api.Pod)
 var rtStatus *metav1.Status
 var pdbName string
 err = retry.RetryOnConflict(EvictionsRetry, func() error {pdbs, err := r.getPodDisruptionBudgets(ctx, pod)
 if err != nil {
 return err
 if len(pdbs)   1 {
 rtStatus =  metav1.Status{
 Status: metav1.StatusFailure,
 Message:  This pod has more than one PodDisruptionBudget, which the eviction subresource does not support. ,
 Code: 500,
 return nil
 } else if len(pdbs) == 1 {pdb := pdbs[0]
 pdbName = pdb.Name
 // Try to verify-and-decrement
 // If it was false already, or if it becomes false during the course of our retries,
 // raise an error marked as a 429.
 if err := r.checkAndDecrement(pod.Namespace, pod.Name, pdb); err != nil {
 return err
 return nil
 if err == wait.ErrWaitTimeout {err = errors.NewTimeoutError(fmt.Sprintf( couldn t update PodDisruptionBudget %q due to conflicts , pdbName), 10)
 if err != nil {
 return nil, err
 if rtStatus != nil {
 return rtStatus, nil
 // At this point there was either no PDB or we succeded in decrementing
 // Try the delete
 _, _, err = r.store.Delete(ctx, eviction.Name, eviction.DeleteOptions)
 if err != nil {
 return nil, err
 // Success!
 return  metav1.Status{Status: metav1.StatusSuccess}, nil
}

通过 EvictionREST 去请求 evict pod 的时候，会检查 pod 只有一个对应的 pdb，否则报错。关于 Eviction API 的使用，请参考 The Eviction API, 下面只给出简单的 Sample：

{
  apiVersion :  policy/v1beta1 ,
  kind :  Eviction ,
  metadata : {
  name :  quux ,
  namespace :  default 
 }
$ curl -v -H  Content-type: application/json  http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json

然后通过 checkAndDecrement 去检查是否满足 PDB 的 manUnavailable 或者 minAvailable，如果满足的话对 pdb.Status.PodDisruptionsAllowed 减 1 处理。

checkAndDecrement 成功的话，就真正去 delete 对应的 Pod。

// checkAndDecrement checks if the provided PodDisruptionBudget allows any disruption.
func (r *EvictionREST) checkAndDecrement(namespace string, podName string, pdb policy.PodDisruptionBudget) error {
 if pdb.Status.ObservedGeneration   pdb.Generation {// TODO(mml): Add a Retry-After header. Once there are time-based
 // budgets, we can sometimes compute a sensible suggested value. But
 // even without that, we can give a suggestion (10 minutes?) that
 // prevents well-behaved clients from hammering us.
 err := errors.NewTooManyRequests(Cannot evict pod as it would violate the pod s disruption budget. , 0)
 err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type:  DisruptionBudget , Message: fmt.Sprintf( The disruption budget %s is still being processed by the server. , pdb.Name)})
 return err
 if pdb.Status.PodDisruptionsAllowed   0 {return errors.NewForbidden(policy.Resource( poddisruptionbudget), pdb.Name, fmt.Errorf(pdb disruptions allowed is negative))
 if len(pdb.Status.DisruptedPods)   MaxDisruptedPodSize {return errors.NewForbidden(policy.Resource( poddisruptionbudget), pdb.Name, fmt.Errorf(DisruptedPods map too big - too many evictions not confirmed by PDB controller))
 if pdb.Status.PodDisruptionsAllowed == 0 {err := errors.NewTooManyRequests( Cannot evict pod as it would violate the pod s disruption budget. , 0)
 err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type:  DisruptionBudget , Message: fmt.Sprintf( The disruption budget %s needs %d healthy pods and has %d currently , pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)})
 return err
 pdb.Status.PodDisruptionsAllowed--
 if pdb.Status.DisruptedPods == nil {pdb.Status.DisruptedPods = make(map[string]metav1.Time)
 // Eviction handler needs to inform the PDB controller that it is about to delete a pod
 // so it should not consider it as available in calculations when updating PodDisruptions allowed.
 // If the pod is not deleted within a reasonable time limit PDB controller will assume that it won t
 // be deleted at all and remove it from DisruptedPod map.
 pdb.Status.DisruptedPods[podName] = metav1.Time{Time: time.Now()}
 if _, err := r.podDisruptionBudgetClient.PodDisruptionBudgets(namespace).UpdateStatus(pdb); err != nil {
 return err
 return nil
}

checkAndDecrement 主要检查 pdb.Status.PodDisruptionsAllowed 是否大于 0，并且 DisruptedPods 包含的 Pods 数不能超过 2000（Disruption Controller 性能可能不足以支撑这么多）。

检查通过，就对 pdb.Status.PodDisruptionsAllowed 减 1，然后将该 Pod 加到 DisruptedPods 这个 Map 中，map 的 value 就是当前时间（apiserver 接受该 eviction request 的时间）。

更新 PDB，PDB Controller 因为监听了 PDB 的 Update Event，接着就会触发 PDB Controller 的逻辑，再次去维护 PDB Status。

Note：PDB 在 scheduler 中也有用到。基于 Pod Priority 进行抢占式调度时，generic_scheduler 进行 preempte pod 时会对 Node 上所有 Pod 进行 PDB 验证，统计违背 PDB 的 Pods 数量，Select Node 时尽量选择违背 PDB Pods 数更少的 node。

到此，关于“Kubernetes 的 PDB 怎么应用”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注丸趣 TV 网站，丸趣 TV 小编会继续努力为大家带来更多实用的文章！

正文完