Kubernetes Resource QoS机制是什么

166次阅读

共计 9101 个字符，预计需要花费 23 分钟才能阅读完成。

本篇内容主要讲解“Kubernetes Resource QoS 机制是什么”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让丸趣 TV 小编来带大家学习“Kubernetes Resource QoS 机制是什么”吧!

Kubernetes Resource QoS Classes 介绍

Kubernetes 根据 Pod 中 Containers Resource 的 request 和 limit 的值来定义 Pod 的 QoS Class。

对于每一种 Resource 都可以将容器分为 3 中 QoS Classes: Guaranteed, Burstable, and Best-Effort，它们的 QoS 级别依次递减。

Guaranteed 如果 Pod 中所有 Container 的所有 Resource 的 limit 和 request 都相等且不为 0，则这个 Pod 的 QoS Class 就是 Guaranteed。

注意，如果一个容器只指明了 limit，而未指明 request，则表明 request 的值等于 limit 的值。

Examples:

containers:
 name: foo
 resources:
 limits:
 cpu: 10m
 memory: 1Gi
 name: bar
 resources:
 limits:
 cpu: 100m
 memory: 100Mi

containers:
 name: foo
 resources:
 limits:
 cpu: 10m
 memory: 1Gi
 requests:
 cpu: 10m
 memory: 1Gi
 name: bar
 resources:
 limits:
 cpu: 100m
 memory: 100Mi
 requests:
 cpu: 100m
 memory: 100Mi

Best-Effort 如果 Pod 中所有容器的所有 Resource 的 request 和 limit 都没有赋值，则这个 Pod 的 QoS Class 就是 Best-Effort.

Examples:

containers:
 name: foo
 resources:
 name: bar
 resources:

Burstable 除了符合 Guaranteed 和 Best-Effort 的场景，其他场景的 Pod QoS Class 都属于 Burstable。

当 limit 值未指定时，其有效值其实是对应 Node Resource 的 Capacity。

Examples:

容器 bar 没有对 Resource 进行指定。

containers:
 name: foo
 resources:
 limits:
 cpu: 10m
 memory: 1Gi
 requests:
 cpu: 10m
 memory: 1Gi
 name: bar

容器 foo 和 bar 对不同的 Resource 进行了指定。

containers:
 name: foo
 resources:
 limits:
 memory: 1Gi
 name: bar
 resources:
 limits:
 cpu: 100m

容器 foo 未指定 limit，容器 bar 未指定 request 和 limit。

containers:
 name: foo
 resources:
 requests:
 cpu: 10m
 memory: 1Gi
 name: bar

可压缩 / 不可压缩资源的区别

kube-scheduler 调度时，是基于 Pod 的 request 值进行 Node Select 完成调度的。Pod 和它的所有 Container 都不允许 Consume limit 指定的有效值 (if have)。

How the request and limit are enforced depends on whether the resource is compressible or incompressible.

Compressible Resource Guarantees

For now, we are only supporting CPU.

Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn t fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.

Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).

Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.

Incompressible Resource Guarantees

For now, we are only supporting memory.

Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).

When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod s containers, will be killed by the kernel.

Admission/Scheduling Policy

Pods will be admitted by Kubelet scheduled by the scheduler based on the sum of requests of its containers. The scheduler kubelet will ensure that sum of requests of all containers is within the node s allocatable capacity (for both memory and CPU).

如何根据不同的 QoS 回收 Resources

CPU Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.

Memory Memory is an incompressible resource and so let s discuss the semantics of memory management a bit.

Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.

Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.

Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.

OOM Score configuration at the Nodes

Pod OOM score configuration

Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed.

The base OOM score is between 0 and 1000, so if process A’s OOM_SCORE_ADJ – process B’s OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B.

The final OOM score of a process is also between 0 and 1000

Best-effort

Set OOM_SCORE_ADJ: 1000

So processes in best-effort containers will have an OOM_SCORE of 1000

Guaranteed

Set OOM_SCORE_ADJ: -998

So processes in guaranteed containers will have an OOM_SCORE of 0 or 1

Burstable

If total memory request 99.8% of available memory, OOM_SCORE_ADJ: 2

Otherwise, set OOM_SCORE_ADJ to 1000 – 10 * (% of memory requested)

This ensures that the OOM_SCORE of burstable pod is 1

If memory request is 0, OOM_SCORE_ADJ is set to 999.

So burstable pods will be killed if they conflict with guaranteed pods

If a burstable pod uses less memory than requested, its OOM_SCORE 1000

So best-effort pods will be killed if they conflict with burstable pods using less than requested memory

If a process in burstable pod s container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be 1000

Assuming that a container typically has a single big process, if a burstable pod s container that uses more memory than requested conflicts with another burstable pod s container using less memory than requested, the former will be killed

If burstable pod s containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure Request and Limit guarantees.

Pod infra containers or Special Pod init process

OOM_SCORE_ADJ: -998

Kubelet, Docker

OOM_SCORE_ADJ: -999 (won’t be OOM killed)

Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume.

源码分析

QoS 的源码位于：pkg/kubelet/qos，代码非常简单，主要就两个文件 pkg/kubelet/qos/policy.go,pkg/kubelet/qos/qos.go。

上面讨论的各个 QoS Class 对应的 OOM_SCORE_ADJ 定义在：

pkg/kubelet/qos/policy.go:21
const (
 PodInfraOOMAdj int = -998
 KubeletOOMScoreAdj int = -999
 DockerOOMScoreAdj int = -999
 KubeProxyOOMScoreAdj int = -999
 guaranteedOOMScoreAdj int = -998
 besteffortOOMScoreAdj int = 1000
)

容器的 OOM_SCORE_ADJ 的计算方法定义在：

pkg/kubelet/qos/policy.go:40
func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {switch GetPodQOS(pod) {
 case Guaranteed:
 // Guaranteed containers should be the last to get killed.
 return guaranteedOOMScoreAdj
 case BestEffort:
 return besteffortOOMScoreAdj
 // Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
 // we want to protect Burstable containers that consume less memory than requested.
 // The formula below is a heuristic. A container requesting for 10% of a system s
 // memory will have an OOM score adjust of 900. If a process in container Y
 // uses over 10% of memory, its OOM score will be 1000. The idea is that containers
 // which use more than their request will have an OOM score of 1000 and will be prime
 // targets for OOM kills.
 // Note that this is a heuristic, it won t work if a container has many small processes.
 memoryRequest := container.Resources.Requests.Memory().Value()
 oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
 // A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure
 // that burstable pods have a higher OOM score adjustment.
 if int(oomScoreAdjust)   (1000 + guaranteedOOMScoreAdj) {return (1000 + guaranteedOOMScoreAdj)
 // Give burstable pods a higher chance of survival over besteffort pods.
 if int(oomScoreAdjust) == besteffortOOMScoreAdj {return int(oomScoreAdjust - 1)
 return int(oomScoreAdjust)
}

获取 Pod 的 QoS Class 的方法为：

pkg/kubelet/qos/qos.go:50
// GetPodQOS returns the QoS class of a pod.
// A pod is besteffort if none of its containers have specified any requests or limits.
// A pod is guaranteed only when requests and limits are specified for all the containers and they are equal.
// A pod is burstable if limits and requests do not match across all containers.
func GetPodQOS(pod *v1.Pod) QOSClass {requests := v1.ResourceList{}
 limits := v1.ResourceList{}
 zeroQuantity := resource.MustParse(0)
 isGuaranteed := true
 for _, container := range pod.Spec.Containers {
 // process requests
 for name, quantity := range container.Resources.Requests {if !supportedQoSComputeResources.Has(string(name)) {
 continue
 if quantity.Cmp(zeroQuantity) == 1 {delta := quantity.Copy()
 if _, exists := requests[name]; !exists {requests[name] = *delta
 } else {delta.Add(requests[name])
 requests[name] = *delta
 // process limits
 qosLimitsFound := sets.NewString()
 for name, quantity := range container.Resources.Limits {if !supportedQoSComputeResources.Has(string(name)) {
 continue
 if quantity.Cmp(zeroQuantity) == 1 {qosLimitsFound.Insert(string(name))
 delta := quantity.Copy()
 if _, exists := limits[name]; !exists {limits[name] = *delta
 } else {delta.Add(limits[name])
 limits[name] = *delta
 if len(qosLimitsFound) != len(supportedQoSComputeResources) {
 isGuaranteed = false
 if len(requests) == 0   len(limits) == 0 {
 return BestEffort
 // Check is requests match limits for all resources.
 if isGuaranteed {
 for name, req := range requests {if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
 isGuaranteed = false
 break
 if isGuaranteed  
 len(requests) == len(limits) {
 return Guaranteed
 return Burstable
}

PodQoS 会在 eviction_manager 和 scheduler 的 Predicates 阶段被调用，也就说会在 k8s 处理超配和调度预选阶段中被使用。

到此，相信大家对“Kubernetes Resource QoS 机制是什么”有了更深的了解，不妨来实际操作一番吧！这里是丸趣 TV 网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！

正文完