本文主要介绍使用load-watcher + scheduler-plugins + descheduler 配合实现k3s根据资源占用平衡调度。

k3s 调度问题

k8s 默认调度器不会更具节点的实际负载进行调度,只会根据request中申请的资源来。所以会导致多数的部署都是往一台比较强劲的节点上,或者新加入的节点不会有任何的pod分配过来。

metrics-server

使用 k8s 默认的metrics-server 来获取节点资源使用情况

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

load-watcher

trimaran 打分时提供当前node的资源使用情况数据。 上游项目地址: https://github.com/paypal/load-watcher 镜像: https://github.com/tsic404/docker-images/pkgs/container/load-watcher

为了 load-watcher 方便读取集群各个节点的资源使用信息,将kubeconfig作为configmap传递到容器中,并指定 KUBE_CONFIG 环境变量。

--- # kube-config configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-config
  labels:
    app: load-watcher
  namespace: loadwatcher
immutable: true
data:
  kube-config: |
    xxxxx    
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: load-watcher-deployment
  namespace: loadwatcher
  labels:
    app: load-watcher
spec:
  replicas: 1
  selector:
    matchLabels:
      app: load-watcher
  template:
    metadata:
      labels:
        app: load-watcher
    spec:
      containers:
      - name: load-watcher
        image: ghcr.io/tsic404/load-watcher:latest
        env:
         - name: KUBE_CONFIG
           value: /kube-config
        ports:
        - containerPort: 2020
        volumeMounts:
          - name: kube-config
            mountPath: /kube-config
            subPath: kube-config
      volumes:
        - name: kube-config
          configMap:
            name: kube-config
            items:
              - key: kube-config
                path: kube-config
---
apiVersion: v1
kind: Service
metadata:
  namespace: loadwatcher
  name: load-watcher
  labels:
    app: load-watcher
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 2020
    targetPort: 2020
    protocol: TCP
  selector:
    app: load-watcher

scheduler-plugins

采用 helm install as a second scheduler 使用 trimaran 插件来根据当前 node 的实际资源占用情况来进行调度。 values.yaml

scheduler:
  name: trimaran
  image: registry.k8s.io/scheduler-plugins/kube-scheduler:v0.26.7
  replicaCount: 1
  leaderElect: false

controller:
  name: scheduler-plugins-controller
  image: registry.k8s.io/scheduler-plugins/controller:v0.26.7
  replicaCount: 1

plugins:
  enabled: ["TargetLoadPacking"]
  disabled: ["NodeResourcesBalancedAllocation", "NodeResourcesLeastAllocated"]

pluginConfig:
- name: TargetLoadPacking
  args:
    watcherAddress: http://load-watcher.loadwatcher.svc.cluster.local:2020
helm install scheduler-plugins as-a-second-scheduler/ --create-namespace --namespace scheduler-plugins -f values.yaml

descheduler

k8s 中pod一旦绑定了节点就不会在进行调度,但是pod运行过程中,pod使用的资源可能会不断变化,这样分配时的平衡就会被打破。所以引入重平衡工具descheduler,让其找到可以可以移除的pod并驱逐他们,重新触发k8s的调度。

helm install descheduler --namespace kube-system descheduler/descheduler --set kind=Deployment

depolyment

由于采用的是将 trimaran 安装成第二调度器,所以需要在 deployment 中显示指明调度器名称

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      schedulerName: trimaran