Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

polaris.checker k8s 部署隔一段时间会出现实例健康异常 #1380

Open
huangpj0210 opened this issue Aug 27, 2024 · 8 comments
Open
Labels
bug Something isn't working

Comments

@huangpj0210
Copy link

huangpj0210 commented Aug 27, 2024

Describe the bug
image
k8s StatefulSet yaml 文件

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: polaris
  namespace: basic
  labels:
    app: polaris
spec:
  replicas: 2
  selector:
    matchLabels:
      app: polaris
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: polaris
    spec:
      volumes:
        - name: polaris-server-config
          configMap:
            name: polaris-server-config
            defaultMode: 420
      containers:
        - name: polaris-server
          image: 'polarismesh/polaris-server:v1.18.1'
          resources:
            limits:
              cpu: '1'
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
            - name: polaris-server-config
              mountPath: /root/conf/polaris-server.yaml
              subPath: polaris-server.yaml
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  serviceName: polaris
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain

polaris 配置

# server启动引导配置
bootstrap:
  # 全局日志
  logger:
    config:
      rotateOutputPath: log/runtime/polaris-config.log
      errorRotateOutputPath: log/runtime/polaris-config-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
      - stdout
      errorOutputPaths:
      - stderr
    auth:
      rotateOutputPath: log/runtime/polaris-auth.log
      errorRotateOutputPath: log/runtime/polaris-auth-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    store:
      rotateOutputPath: log/runtime/polaris-store.log
      errorRotateOutputPath: log/runtime/polaris-store-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    cache:
      rotateOutputPath: log/runtime/polaris-cache.log
      errorRotateOutputPath: log/runtime/polaris-cache-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    naming:
      rotateOutputPath: log/runtime/polaris-naming.log
      errorRotateOutputPath: log/runtime/polaris-naming-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    healthcheck:
      rotateOutputPath: log/runtime/polaris-healthcheck.log
      errorRotateOutputPath: log/runtime/polaris-healthcheck-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    xdsv3:
      rotateOutputPath: log/runtime/polaris-xdsv3.log
      errorRotateOutputPath: log/runtime/polaris-xdsv3-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    apiserver:
      rotateOutputPath: log/runtime/polaris-apiserver.log
      errorRotateOutputPath: log/runtime/polaris-apiserver-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    discoverEventLocal:
      rotateOutputPath: log/event/polaris-discoverevent.log
      errorRotateOutputPath: log/event/polaris-discoverevent-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    discoverstat:
      rotateOutputPath: log/statis/polaris-discoverstat.log
      errorRotateOutputPath: log/statis/polaris-discoverstat-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    token-bucket:
      rotateOutputPath: log/runtime/polaris-ratelimit.log
      errorRotateOutputPath: log/runtime/polaris-ratelimit-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    local:
      rotateOutputPath: log/statis/polaris-statis.log
      errorRotateOutputPath: log/statis/polaris-statis-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    HistoryLogger:
      rotateOutputPath: log/operation/polaris-history.log
      errorRotateOutputPath: log/operation/polaris-history-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      rotationMaxDurationForHour: 24
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    default:
      rotateOutputPath: log/runtime/polaris-default.log
      errorRotateOutputPath: log/runtime/polaris-default-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    nacos-apiserver:
      rotateOutputPath: log/runtime/nacos-apiserver.log
      errorRotateOutputPath: log/runtime/nacos-apiserver-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 30
      rotationMaxAge: 7
      outputLevel: info
      compress: true
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
  # 按顺序启动server
  startInOrder:
    open: true # 是否开启,默认是关闭
    key: sz # 全局锁
  # 注册为北极星服务
  polaris_service:
    probe_address: basic.mysql:3306
    enable_register: true
    isolated: false
    services:
      - name: polaris.checker
        protocols:
          - service-grpc
# apiserver配置
apiservers:
  - name: service-eureka
    option:
      listenIP: "0.0.0.0"
      listenPort: 8761
      namespace: default
      owner: polaris
      refreshInterval: 10
      deltaExpireInterval: 60
      unhealthyExpireInterval: 180
      connLimit:
        openConnLimit: false
        maxConnPerHost: 1024
        maxConnLimit: 10240
        whiteList: 127.0.0.1
        purgeCounterInterval: 10s
        purgeCounterExpired: 5s
  - name: api-http # 协议名,全局唯一
    option:
      listenIP: "0.0.0.0"
      listenPort: 8090
      enablePprof: true # debug pprof
      enableSwagger: true # debug pprof
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 5120
        whiteList: 127.0.0.1
        purgeCounterInterval: 10s
        purgeCounterExpired: 5s
    api:
      admin:
        enable: true
      console:
        enable: true
        include: [default]
      client:
        enable: true
        include: [discover, register, healthcheck]
      config:
        enable: true
        include: [default]
  - name: service-grpc
    option:
      listenIP: "0.0.0.0"
      listenPort: 8091
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 5120
    api:
      client:
        enable: true
        include: [discover, register, healthcheck]
  - name: config-grpc
    option:
      listenIP: "0.0.0.0"
      listenPort: 8093
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 5120
    api:
      client:
        enable: true
  - name: xds-v3
    option:
      listenIP: "0.0.0.0"
      listenPort: 15010
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 10240
  - name: service-nacos
    option:
      listenIP: "0.0.0.0"
      listenPort: 8848
      # 设置 nacos 默认命名空间对应 Polaris 命名空间信息
      defaultNamespace: default
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 10240
#  - name: service-l5
#    option:
#      listenIP: 0.0.0.0
#      listenPort: 7779
#      clusterName: cl5.discover
# Core logic configuration
auth:
  # auth's option has migrated to auth.user and auth.strategy
  # it's still available when filling auth.option, but you will receive warning log that auth.option has deprecated.
  user:
    name: defaultUser
    option:
      # Token encrypted SALT, you need to rely on this SALT to decrypt the information of the Token when analyzing the Token
      # The length of SALT needs to satisfy the following one:len(salt) in [16, 24, 32]
      salt: polarismesh2023
  strategy:
    name: defaultStrategy
    option:
      # Console auth switch, default true
      consoleOpen: true
      # Console Strict Model, default true
      consoleStrict: true
      # Customer auth switch, default false
      clientOpen: false
      # Customer Strict Model, default close
      clientStrict: false
namespace:
  autoCreate: true
naming:
  # 批量控制器
  batch:
    register:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
    deregister:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
    clientRegister:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
    clientDeregister:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
# 配置中心模块启动配置
config:
  # 是否启动配置模块
  open: true
# 健康检查的配置
healthcheck:
  open: true
  service: polaris.checker
  slotNum: 30
  minCheckInterval: 1s
  maxCheckInterval: 30s
  batch:
    heartbeat:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
  checkers:
    - name: heartbeatLeader
      option:
        soltNum: 128
# Maintain configuration
maintain:
  jobs:
    # Clean up long term unhealthy instance
    - name: DeleteUnHealthyInstance
      enable: false
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        instanceDeleteTimeout: 60m
    # Delete auto-created service without an instance
    - name: DeleteEmptyAutoCreatedService
      enable: false
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        serviceDeleteTimeout: 30m
    # Clean soft deleted instances
    - name: CleanDeletedInstances
      enable: true
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        # instanceCleanTimeout: 10m
    # Clean soft deleted clients
    - name: CleanDeletedClients
      enable: true
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        # clientCleanTimeout: 10m
# 存储配置
store:
  # 单机文件存储插件
  # name: boltdbStore
  # option:
  #   path: ./polaris.bolt
  # 数据库存储插件
  name: defaultStore
  option:
    master:
       dbType: mysql
       dbName: polaris_server
       dbUser: polaris
       dbPwd: #密码#
       dbAddr: basic.mysql:3306
       maxOpenConns: -1
       maxIdleConns: -1
       connMaxLifetime: 300 # 单位秒
       txIsolationLevel: 2 #LevelReadCommitted
# 插件配置
plugin:
  history:
    entries:
      - name: HistoryLogger
  discoverEvent:
    entries:
      - name: discoverEventLocal
  statis:
    entries:
      - name: local
        option:
          interval: 60
      - name: prometheus
  ratelimit:
    name: token-bucket
    option:
      remote-conf: false # 是否使用远程配置
      ip-limit: # ip级限流,全局
        open: false # 系统是否开启ip级限流
        global:
          open: false
          bucket: 300 # 最高峰值
          rate: 200 # 平均一个IP每秒的请求数
        resource-cache-amount: 1024 # 最大缓存的IP个数
        white-list: [127.0.0.1]
      instance-limit:
        open: false
        global:
          bucket: 200
          rate: 100
        resource-cache-amount: 1024
      api-limit: # 接口级限流
        open: false # 是否开启接口限流,全局开关,只有为true,才代表系统的限流开启。默认关闭
        rules:
          - name: store-read
            limit:
              open: false # 接口的全局配置,如果在api子项中,不配置,则该接口依据global来做限制
              bucket: 2000 # 令牌桶最大值
              rate: 1000 # 每秒产生的令牌数
          - name: store-write
            limit:
              open: false
              bucket: 1000
              rate: 500
        apis:
          - name: "POST:/v1/naming/services"
            rule: store-write
          - name: "PUT:/v1/naming/services"
            rule: store-write
          - name: "POST:/v1/naming/services/delete"
            rule: store-write
          - name: "GET:/v1/naming/services"
            rule: store-read
          - name: "GET:/v1/naming/services/count"
            rule: store-read
  crypto: # 配置加密
    entries:
      - name: AES                               

心跳检查日志
polaris-healthcheck.log
polaris-healthcheck-error.log
To Reproduce
StatefulSet replicas 2,运行几天后会出现一个实例健康异常的情况,replicas 1 从未遇到过这种问题

Expected behavior
polaris实例健康运行

Environment

  • Version: v1.18.1
  • OS: Alibaba Cloud Linux 3.2104 U9.1

Additional context
Add any other context about the problem here.

@huangpj0210 huangpj0210 added the bug Something isn't working label Aug 27, 2024
@chuntaojun
Copy link
Member

无法自动恢复为监控吗

@huangpj0210
Copy link
Author

image
不行 需要删除 pod 重新创建才会恢复健康,隔一段时间自己又会变成异常,我看了polaris-server.yaml并没有健康检查的配置。但是我的 pod 有时候会自动重启然后就恢复了,我查看了 k8s的polaris 的 pod event 和StatefulSet event里面事件记录都是空
image
image

无法自动恢复为监控吗?

@chuntaojun
Copy link
Member

明白,我这里本地check看下

@huangpj0210
Copy link
Author

明白了,我在这里本地查看看下
你好,我发现一个问题异常的实例内存占用的的特别高,我 pod 内存 limit 限制了 2G,应该是异常实例达到上限OOMKiller才触发了自动重启了
image
image
image

@chuntaojun
Copy link
Member

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里

@huangpj0210
Copy link
Author

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里
你好这个我刚从异常 polaris 实例10.233.111.79,导出来的 heap.pprof文件,麻烦查看一下
heap.pprof.zip

@flashbai
Copy link

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里
你好这个我刚从异常 polaris 实例10.233.111.79,导出来的 heap.pprof文件,麻烦查看一下
heap.pprof.zip

你好,问下这块问题有结论了吗,我也遇到同样问题想请教一下

@chuntaojun
Copy link
Member

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里
你好这个我刚从异常 polaris 实例10.233.111.79,导出来的 heap.pprof文件,麻烦查看一下
heap.pprof.zip

你好,问下这块问题有结论了吗,我也遇到同样问题想请教一下

确实存在几个比较隐含的内存 OOM 的情况

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants
@chuntaojun @flashbai @huangpj0210 and others