基于Mixerless Telemetry如何实现渐进式灰度发布系

115次阅读

没有评论

共计 12561 个字符，预计需要花费 32 分钟才能阅读完成。

基于 Mixerless Telemetry 如何实现渐进式灰度发布系，很多新手对此不是很清楚，为了帮助大家解决这个难题，下面丸趣 TV 小编将为大家详细讲解，有这方面需求的人可以来学习下，希望你能有所收获。

作为 CNCF 成员，Weave Flagger 提供了持续集成和持续交付的各项能力。Flagger 将渐进式发布总结为 3 类：

灰度发布 / 金丝雀发布(Canary)：用于渐进式切流到灰度版本(progressive traffic shifting)

A/ B 测试(A/B Testing)：用于根据请求信息将用户请求路由到 A / B 版本(HTTP headers and cookies traffic routing)

蓝绿发布(Blue/Green)：用于流量切换和流量复制 (traffic switching and mirroring)

本篇将介绍 Flagger on ASM 的渐进式灰度发布实践。

Setup Flagger

1 部署 Flagger

执行如下命令部署 flagger(完整脚本参见：demo_canary.sh)。

alias k= kubectl --kubeconfig $USER_CONFIG 
alias h= helm --kubeconfig $USER_CONFIG 
cp $MESH_CONFIG kubeconfig
k -n istio-system create secret generic istio-kubeconfig --from-file kubeconfig
k -n istio-system label secret istio-kubeconfig istio/multiCluster=true
h repo add flagger https://flagger.app
h repo update
k apply -f $FLAAGER_SRC/artifacts/flagger/crd.yaml
h upgrade -i flagger flagger/flagger --namespace=istio-system \
 --set crd.create=false \
 --set meshProvider=istio \
 --set metricsServer=http://prometheus:9090 \
 --set istio.kubeconfig.secretName=istio-kubeconfig \
 --set istio.kubeconfig.key=kubeconfig

2 部署 Gateway

在灰度发布过程中，Flagger 会请求 ASM 更新用于灰度流量配置的 VirtualService，这个 VirtualService 会使用到命名为 public-gateway 的 Gateway。为此我们创建相关 Gateway 配置文件 public-gateway.yaml 如下：

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
 name: public-gateway
 namespace: istio-system
spec:
 selector:
 istio: ingressgateway
 servers:
 - port:
 number: 80
 name: http
 protocol: HTTP
 hosts:
 -  *

执行如下命令部署 Gateway：

kubectl --kubeconfig  $MESH_CONFIG  apply -f resources_canary/public-gateway.yaml

3 部署 flagger-loadtester

flagger-loadtester 是灰度发布阶段，用于探测灰度 POD 实例的应用。

执行如下命令部署 flagger-loadtester：

kubectl --kubeconfig  $USER_CONFIG  apply -k  https://github.com/fluxcd/flagger//kustomize/tester?ref=main

4 部署 PodInfo 及其 HPA

我们首先使用 Flagger 发行版自带的 HPA 配置(这是一个运维级的 HPA)，待完成完整流程后，我们再使用应用级的 HPA。

执行如下命令部署 PodInfo 及其 HPA：

kubectl --kubeconfig  $USER_CONFIG  apply -k  https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main

渐进式灰度发布

1 部署 Canary

Canary 是基于 Flagger 进行灰度发布的核心 CRD，详见 How it works。我们首先部署如下 Canary 配置文件 podinfo-canary.yaml，完成完整的渐进式灰度流程，然后在此基础上引入应用维度的监控指标，来进一步实现应用有感知的渐进式灰度发布。

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
 name: podinfo
 namespace: test
spec:
 # deployment reference
 targetRef:
 apiVersion: apps/v1
 kind: Deployment
 name: podinfo
 # the maximum time in seconds for the canary deployment
 # to make progress before it is rollback (default 600s)
 progressDeadlineSeconds: 60
 # HPA reference (optional)
 autoscalerRef:
 apiVersion: autoscaling/v2beta2
 kind: HorizontalPodAutoscaler
 name: podinfo
 service:
 # service port number
 port: 9898
 # container port number or name (optional)
 targetPort: 9898
 # Istio gateways (optional)
 gateways:
 - public-gateway.istio-system.svc.cluster.local
 # Istio virtual service host names (optional)
 hosts:
 -  * 
 # Istio traffic policy (optional)
 trafficPolicy:
 tls:
 # use ISTIO_MUTUAL when mTLS is enabled
 mode: DISABLE
 # Istio retry policy (optional)
 retries:
 attempts: 3
 perTryTimeout: 1s
 retryOn:  gateway-error,connect-failure,refused-stream 
 analysis:
 # schedule interval (default 60s)
 interval: 1m
 # max number of failed metric checks before rollback
 threshold: 5
 # max traffic percentage routed to canary
 # percentage (0-100)
 maxWeight: 50
 # canary increment step
 # percentage (0-100)
 stepWeight: 10
 metrics:
 - name: request-success-rate
 # minimum req success rate (non 5xx responses)
 # percentage (0-100)
 thresholdRange:
 min: 99
 interval: 1m
 - name: request-duration
 # maximum req duration P99
 # milliseconds
 thresholdRange:
 max: 500
 interval: 30s
 # testing (optional)
 webhooks:
 - name: acceptance-test
 type: pre-rollout
 url: http://flagger-loadtester.test/
 timeout: 30s
 metadata:
 type: bash
 cmd:  curl -sd  test  http://podinfo-canary:9898/token | grep token 
 - name: load-test
 url: http://flagger-loadtester.test/
 timeout: 5s
 metadata:
 cmd:  hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/

执行如下命令部署 Canary：

kubectl --kubeconfig  $USER_CONFIG  apply -f resources_canary/podinfo-canary.yaml

部署 Canary 后，Flagger 会将名为 podinfo 的 Deployment 复制为 podinfo-primary，并将 podinfo-primary 扩容至 HPA 定义的最小 POD 数量。然后逐步将名为 podinfo 的这个 Deployment 的 POD 数量将缩容至 0。也就是说，podinfo 将作为灰度版本的 Deployment，podinfo-primary 将作为生产版本的 Deployment。

同时，创建 3 个服务——podinfo、podinfo-primary 和 podinfo-canary，前两者指向 podinfo-primary 这个 Deployment，最后者指向 podinfo 这个 Deployment。

2 升级 podinfo

执行如下命令，将灰度 Deployment 的版本从 3.1.0 升级到 3.1.1：

kubectl --kubeconfig  $USER_CONFIG  -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

3 渐进式灰度发布

此时，Flagger 将开始执行如本系列第一篇所述的渐进式灰度发布流程，这里再简述主要流程如下：

逐步扩容灰度 POD、验证

渐进式切流、验证

滚动升级生产 Deployment、验证

100% 切回生产

缩容灰度 POD 至 0

我们可以通过如下命令观察这个渐进式切流的过程：

while true; do kubectl --kubeconfig  $USER_CONFIG  -n test describe canary/podinfo; sleep 10s;done

输出的日志信息示意如下：

Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Warning Synced 39m flagger podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
 Normal Synced 38m (x2 over 39m) flagger all the metrics providers are available!
 Normal Synced 38m flagger Initialization done! podinfo.test
 Normal Synced 37m flagger New revision detected! Scaling up podinfo.test
 Normal Synced 36m flagger Starting canary analysis for podinfo.test
 Normal Synced 36m flagger Pre-rollout check acceptance-test passed
 Normal Synced 36m flagger Advance podinfo.test canary weight 10
 Normal Synced 35m flagger Advance podinfo.test canary weight 20
 Normal Synced 34m flagger Advance podinfo.test canary weight 30
 Normal Synced 33m flagger Advance podinfo.test canary weight 40
 Normal Synced 29m (x4 over 32m) flagger (combined from similar events): Promotion completed! Scaling down podinfo.test

相应的 Kiali 视图(可选)，如下图所示：

到此，我们完成了一个完整的渐进式灰度发布流程。如下是扩展阅读。

灰度中的应用级扩缩容

在完成上述渐进式灰度发布流程的基础上，我们接下来再来看上述 Canary 配置中，关于 HPA 的配置。

autoscalerRef:
 apiVersion: autoscaling/v2beta2
 kind: HorizontalPodAutoscaler
 name: podinfo

这个名为 podinfo 的 HPA 是 Flagger 自带的配置，当灰度 Deployment 的 CPU 利用率达到 99% 时扩容。完整配置如下：

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
 name: podinfo
spec:
 scaleTargetRef:
 apiVersion: apps/v1
 kind: Deployment
 name: podinfo
 minReplicas: 2
 maxReplicas: 4
 metrics:
 - type: Resource
 resource:
 name: cpu
 target:
 type: Utilization
 # scale up if usage is above
 # 99% of the requested CPU (100m)
 averageUtilization: 99

我们在前面一篇中讲述了应用级扩缩容的实践，在此，我们将其应用于灰度发布的过程中。

1 感知应用 QPS 的 HPA

执行如下命令部署感知应用请求数量的 HPA，实现在 QPS 达到 10 时进行扩容(完整脚本参见：advanced_canary.sh)：

kubectl --kubeconfig  $USER_CONFIG  apply -f resources_hpa/requests_total_hpa.yaml

相应地，Canary 配置更新为：

autoscalerRef:
 apiVersion: autoscaling/v2beta2
 kind: HorizontalPodAutoscaler
 name: podinfo-total

2 升级 podinfo

执行如下命令，将灰度 Deployment 的版本从 3.1.0 升级到 3.1.1：

kubectl --kubeconfig  $USER_CONFIG  -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1

3 验证渐进式灰度发布及 HPA

命令观察这个渐进式切流的过程：

while true; do k -n test describe canary/podinfo; sleep 10s;done

在渐进式灰度发布过程中(在出现 Advance podinfo.test canary weight 10 信息后，见下图)，我们使用如下命令，从入口网关发起请求以增加 QPS：

INGRESS_GATEWAY=$(kubectl --kubeconfig $USER_CONFIG -n istio-system get service istio-ingressgateway -o jsonpath= {.status.loadBalancer.ingress[0].ip} )
hey -z 20m -c 2 -q 10 http://$INGRESS_GATEWAY

使用如下命令观察渐进式灰度发布进度：

watch kubectl --kubeconfig $USER_CONFIG get canaries --all-namespaces

使用如下命令观察 hpa 的副本数变化：

watch kubectl --kubeconfig $USER_CONFIG -n test get hpa/podinfo-total

结果如下图所示，在渐进式灰度发布过程中，当切流到 30% 的某一时刻，灰度 Deployment 的副本数为 4：

灰度中的应用级监控指标

在完成上述灰度中的应用级扩缩容的基础上，最后我们再来看上述 Canary 配置中，关于 metrics 的配置：

analysis:
 metrics:
 - name: request-success-rate
 # minimum req success rate (non 5xx responses)
 # percentage (0-100)
 thresholdRange:
 min: 99
 interval: 1m
 - name: request-duration
 # maximum req duration P99
 # milliseconds
 thresholdRange:
 max: 500
 interval: 30s
 # testing (optional)

1 Flagger 内置监控指标

到目前为止，Canary 中使用的 metrics 配置一直是 Flagger 的两个内置监控指标：请求成功率 (request-success-rate) 和请求延迟(request-duration)。如下图所示，Flagger 中不同平台对内置监控指标的定义，其中，istio 使用的是本系列第一篇介绍的 Mixerless Telemetry 相关的遥测数据。

2 自定义监控指标

为了展示灰度发布过程中，遥测数据为验证灰度环境带来的更多灵活性，我们再次以 istio_requests_total 为例，创建一个名为 not-found-percentage 的 MetricTemplate，统计请求返回 404 错误码的数量占请求总数的比例。

配置文件 metrics-404.yaml 如下(完整脚本参见：advanced_canary.sh)：

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
 name: not-found-percentage
 namespace: istio-system
spec:
 provider:
 type: prometheus
 address: http://prometheus.istio-system:9090
 query: |
 100 - sum(
 rate(
 istio_requests_total{
 reporter= destination ,
 destination_workload_namespace= {{ namespace }} ,
 destination_workload= {{ target }} ,
 response_code!= 404 
 }[{{ interval }}]
 )
 )
 /
 sum(
 rate(
 istio_requests_total{
 reporter= destination ,
 destination_workload_namespace= {{ namespace }} ,
 destination_workload= {{ target }} 
 }[{{ interval }}]
 )
 ) * 100

执行如下命令创建上述 MetricTemplate：

k apply -f resources_canary2/metrics-404.yaml

相应地，Canary 中 metrics 的配置更新为：

analysis:
 metrics:
 - name:  404s percentage 
 templateRef:
 name: not-found-percentage
 namespace: istio-system
 thresholdRange:
 max: 5
 interval: 1m

3 最后的验证

最后，我们一次执行完整的实验脚本。脚本 advanced_canary.sh 示意如下：

#!/usr/bin/env sh
SCRIPT_PATH= $( cd  $(dirname  $0)   /dev/null 2 1
 pwd -P
cd  $SCRIPT_PATH  || exit
source config
alias k= kubectl --kubeconfig $USER_CONFIG 
alias m= kubectl --kubeconfig $MESH_CONFIG 
alias h= helm --kubeconfig $USER_CONFIG 
echo  #### I Bootstrap #### 
echo  1 Create a test namespace with Istio sidecar injection enabled: 
k delete ns test
m delete ns test
k create ns test
m create ns test
m label namespace test istio-injection=enabled
echo  2 Create a deployment and a horizontal pod autoscaler: 
k apply -f $FLAAGER_SRC/kustomize/podinfo/deployment.yaml -n test
k apply -f resources_hpa/requests_total_hpa.yaml
k get hpa -n test
echo  3 Deploy the load testing service to generate traffic during the canary analysis: 
k apply -k  https://github.com/fluxcd/flagger//kustomize/tester?ref=main 
k get pod,svc -n test
echo  ...... 
sleep 40s
echo  4 Create a canary custom resource: 
k apply -f resources_canary2/metrics-404.yaml
k apply -f resources_canary2/podinfo-canary.yaml
k get pod,svc -n test
echo  ...... 
sleep 120s
echo  #### III Automated canary promotion #### 
echo  1 Trigger a canary deployment by updating the container image: 
k -n test set image deployment/podinfo podinfod=stefanprodan/podinfo:3.1.1
echo  2 Flagger detects that the deployment revision changed and starts a new rollout: 
while true; do k -n test describe canary/podinfo; sleep 10s;done

使用如下命令执行完整的实验脚本：

sh progressive_delivery/advanced_canary.sh

实验结果示意如下：

#### I Bootstrap ####
1 Create a test namespace with Istio sidecar injection enabled:
namespace  test  deleted
namespace  test  deleted
namespace/test created
namespace/test created
namespace/test labeled
2 Create a deployment and a horizontal pod autoscaler:
deployment.apps/podinfo created
horizontalpodautoscaler.autoscaling/podinfo-total created
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
podinfo-total Deployment/podinfo  unknown /10 (avg) 1 5 0 0s
3 Deploy the load testing service to generate traffic during the canary analysis:
service/flagger-loadtester created
deployment.apps/flagger-loadtester created
NAME READY STATUS RESTARTS AGE
pod/flagger-loadtester-76798b5f4c-ftlbn 0/2 Init:0/1 0 1s
pod/podinfo-689f645b78-65n9d 1/1 Running 0 28s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/flagger-loadtester ClusterIP 172.21.15.223  none  80/TCP 1s
......
4 Create a canary custom resource:
metrictemplate.flagger.app/not-found-percentage created
canary.flagger.app/podinfo created
NAME READY STATUS RESTARTS AGE
pod/flagger-loadtester-76798b5f4c-ftlbn 2/2 Running 0 41s
pod/podinfo-689f645b78-65n9d 1/1 Running 0 68s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/flagger-loadtester ClusterIP 172.21.15.223  none  80/TCP 41s
......
#### III Automated canary promotion ####
1 Trigger a canary deployment by updating the container image:
deployment.apps/podinfo image updated
2 Flagger detects that the deployment revision changed and starts a new rollout:
Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Warning Synced 10m flagger podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation
 Normal Synced 9m23s (x2 over 10m) flagger all the metrics providers are available!
 Normal Synced 9m23s flagger Initialization done! podinfo.test
 Normal Synced 8m23s flagger New revision detected! Scaling up podinfo.test
 Normal Synced 7m23s flagger Starting canary analysis for podinfo.test
 Normal Synced 7m23s flagger Pre-rollout check acceptance-test passed
 Normal Synced 7m23s flagger Advance podinfo.test canary weight 10
 Normal Synced 6m23s flagger Advance podinfo.test canary weight 20
 Normal Synced 5m23s flagger Advance podinfo.test canary weight 30
 Normal Synced 4m23s flagger Advance podinfo.test canary weight 40
 Normal Synced 23s (x4 over 3m23s) flagger (combined from similar events): Promo

看完上述内容是否对您有帮助呢？如果还想对相关知识有进一步的了解或阅读更多相关文章，请关注丸趣 TV 行业资讯频道，感谢您对丸趣 TV 的支持。

正文完