Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

NATS Cluster unstable in Redhat OpenShift with Alpine Image but behaving stable in VM [v2.10.19] #5881

Open
mohamedsaleem18 opened this issue Sep 11, 2024 · 8 comments
Labels
k8s stale This issue has had no activity in a while

Comments

@mohamedsaleem18
Copy link

mohamedsaleem18 commented Sep 11, 2024

Observed behavior

Issue Description:

We are experiencing instability with NATS 2.10.19 (Jetstream enabled) cluster when deployed in a Redhat OpenShift environment, even though the same setup (using 2.10.19) works normally in a VM environment. Multitenancy is enabled in both the VM and OpenShift environments.

Issues observed in OpenShift (NATS cluster):

  1. Intermittent Message Loss: Subscribers are not receiving all messages intermittently. This issue was observed with Alpine versions 2.10.11 and 2.10.18 also in the Redhat OpenShift cluster.
  2. Responder Errors: Frequent "responder not available" errors occur while publishing messages to the NATS cluster. For a message count of 25, we encounter approximately 5 to 8 responder not available errors. Testing was conducted with only one user.
  3. Connection Issues: The NATS cluster stops accepting connections from subscribers if it runs continuously for more than a week. Restarting the NATS cluster resolves the issue. This problem was present in versions 2.10.11 and 2.10.18. Version 2.10.19 has not yet been tested for a one-week duration. The 2.10.11 and 2.10.18 versions were also tested with dedicated NFS storage in addition to container storage.
  4. Replica Synchronization Issues: NATS replicas occasionally go out of sync with the cluster, and auto-healing does not occur automatically. Restarting the NATS replica is required to re-sync with the cluster. This issue was observed with versions 2.10.11 and 2.10.18. Version 2.10.19 has not yet been tested for a one-week duration. Versions 2.10.11 and 2.10.18 were also tested with dedicated NFS storage in addition to container storage.

OpenShift details:

NATS Image: nats:2.10.19-alpine
NATS Cluster Size: 3 replicas
Storage Type: Container storage (i.e., data inside container)
Deployment Type: StatefulSet
Redhat Openshift cluster version: Server Version: 4.16.7 (Local installation with one VM). Note:- NATS versions 2.10.11 and 2.10.18 was tested with Redhat OpenShift Server Version: 4.14.20 with multiple VM nodes.

NATS server configuration:

data:
  nats-server.conf: |
    server_name: $POD_NAME
    listen: 0.0.0.0:4222
    http: 0.0.0.0:8222
    accounts: {     account details removed.
    }
    # Clustering definition
    cluster {
      name: "nats-21019"
      listen: 0.0.0.0:6222
      pool_size: 16
      ping_max: 4
      ping_interval: 30s
      routes = [
        nats://nats-21019-headless:6222   # Headless service configured in openshift
      ]
    }
    # JetStream configuration
    jetstream: {
      store_dir: "/data"
      max_memory_store: 1Gi
      max_file_store: 2Gi
    }

VM details:

NATS version: 2.10.19
NATS Cluster size: 3
Storage type: Local filesystem storage on VM.
VM Operating System: Oracle Linux Server 8.10
VM CPE OS Name: cpe:/o:oracle:linux:8:10:server
VM Kernel: Linux 5.4.17-2136.334.6.1.el8uek.x86_64
VM Architecture: x86-64

NATS server config:

server_name: "xxxx_cluster_node_1"
cluster {
        host: 0.0.0.0
        port: 7222
        name: "xxxx_cluster"
        pool_size: 16
        ping_max: 4
        ping_interval: 30s
        # cluster_advertise: "127.0.0.1:7222"
        authorization {
                user: "nats"
                password: "xxxxx"
                timeout: 0.5
        }
        routes = [
                "nats://nats:xxxx@nats-2:7222"
                "nats://nats:xxx@nats-3:7222"
        ]
}
jetstream {
    # Jetstream storage location, limits and encryption
    store_dir: "/root/nats/data"
    domain: xxx_product
}
accounts: {     account details removed.
}

Expected behavior

NATS 2.10.19 should be stable in Redhat openshift environment.

Server and client version

nats cli version : 0.1.5
nats-server: v2.10.19

Host environment

Please refer Issue description section for complete details.

Steps to reproduce

  1. Perform bench test with multiple publishers and subscribers in pull mode with Jetstream enabled.
  2. Create a stream with a subject and one consumer for that stream.
  3. Publish messages into the subject.
  4. Consume the message using the consumer created in the step2.
  5. If steps 3 and 4 are repeated for 1 day, issues listed in the description section will be observed.
@mohamedsaleem18 mohamedsaleem18 added the defect Suspected defect such as a bug or regression label Sep 11, 2024
@wallyqs
Copy link
Member

wallyqs commented Sep 11, 2024

one of the main problems is going to be that routes are defined using the service name instead of the A records available for the StatefulSet

      routes = [
        nats://nats-21019-headless:6222   # Headless service configured in openshift
      ]

above should be instead something like:

      routes = [
        nats://nats-0.nats-21019-headless:6222
        nats://nats-1.nats-21019-headless:6222
        nats://nats-2.nats-21019-headless:6222
      ]

@wallyqs
Copy link
Member

wallyqs commented Sep 11, 2024

Also deploying JetStream on NFS volumes is not recommended.

@mohamedsaleem18
Copy link
Author

Thank you for your response. NFS volumes was replaced with container storage. Routes was defined from helm chart (bitnami/nats.io). We can enable additional routes as mentioned above. Consumer not able to connect to NATS completely, I assume this is not related to routes. Frequent responders not available error is also not related to routes. Can you please support us to resolve those errors ?

@ripienaar
Copy link
Contributor

Routes is how nodes communicate with each other. No cluster comms no working streams and consumers.

Fix those first.

@mohamedsaleem18
Copy link
Author

I have attached my statefulset.yaml file with this. This is just for testing purpose in my local openshift cluster. We were using official helm chart from bitnami and nats.io for 2.10.11 and 2.10.18.

Can you please review and let me know if any correction should be made to this in addition to routes? That would be helpful.

nats-statefulset.txt

@wallyqs
Copy link
Member

wallyqs commented Sep 11, 2024

@mohamedsaleem18 the statefulset has the issue with the cluster routes not being explicit, I think this is a setting from the bitnami chart is wrong :/ the one we maintain at nats-io/k8s would include the right routes. If you do not set the routes as I mentioned, on restarts there will be partitions and the cluster will not work well.

@wallyqs wallyqs added k8s and removed defect Suspected defect such as a bug or regression labels Sep 11, 2024
@mohamedsaleem18
Copy link
Author

I have setup the cluster with below server configuration (routes as recommended) and started testing. I will update you. Thank you for your support.
server_name: $POD_NAME
listen: 0.0.0.0:4222
http: 0.0.0.0:8222

accounts: {
$SYS: {
users: [{"password":"xx","user":"admin"},{"password":"xx","user":"xxadmin"}]
}
platform: {
"jetstream": enabled
users: [{"password":"xx","permissions":{"publish":">","subscribe":">"},"user":"platformadmin"}]
}
}

Clustering definition

cluster {
name: "nats-21019"
listen: 0.0.0.0:6222
pool_size: 3
ping_max: 4
ping_interval: 30s

Authorization for cluster connections

Routes are actively solicited and connected to from this server.

Other servers can connect to us if they supply the correct credentials

in their routes definitions from above

routes = [
nats://nats-21019-0.nats-21019-headless.nats-test.svc.cluster.local:6222
nats://nats-21019-1.nats-21019-headless.nats-test.svc.cluster.local:6222
nats://nats-21019-2.nats-21019-headless.nats-test.svc.cluster.local:6222
]
}

JetStream configuration

jetstream: {
store_dir: "/data"
max_memory_store: 1Gi
max_file_store: 2Gi
}

@mohamedsaleem18
Copy link
Author

I ran bench test deploying 3 nodes cluster (container storage) in Redhat openshift cluster. Encountered the same issue after configuring the routes as per the recommendation. Subscribers struck without pulling the message. Ran the test in my laptop (so no network issue). Please refer the screenshot below and NATS server logs attached.

nats-21019-2-nats-21019.log
nats-21019-1-nats-21019.log
nats-21019-0-nats-21019.log

nats -s nats://127.0.0.1:31422,nats://127.0.0.1:31421,nats://127.0.0.1:31420 --user platformadmin --password xxxx bench bar --js --pub 10 --sub 20 --size 16 --replicas 3 --msgs 100000 --pull --purge --pubsleep 2s

image

@wallyqs wallyqs changed the title NATS Cluster unstable in Redhat OpenShift with Alpine Image 2.10.19 but behaving stable in VM. NATS Cluster unstable in Redhat OpenShift with Alpine Image but behaving stable in VM [v2.10.19] Sep 13, 2024
@github-actions github-actions bot added the stale This issue has had no activity in a while label Nov 9, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
k8s stale This issue has had no activity in a while
Projects
None yet
Development

No branches or pull requests

3 participants