Skip to content

Splunk pod crashloops when at Set general pass4SymmKey task in Ansible v1.0.1 #401

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
cicsyd opened this issue Jul 6, 2021 · 2 comments
Closed

Comments

@cicsyd
Copy link

cicsyd commented Jul 6, 2021

Hi,

We're having trouble deploying Splunk in our EKS environment.
Each time we create a splunk cluster master or even standalone, the startup of the pod fails at this stage:

TASK [splunk_common : Set general pass4SymmKey] ********************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: PermissionError: [Errno 13] Permission denied: b'/opt/splunk/etc/system/local/.ansible_tmpyks780qsserver.conf'
fatal: [localhost]: FAILED! => {
"changed": false
}

MSG:

Failed to replace file: /home/splunk/.ansible/tmp/ansible-moduletmp-1625547034.5380409-rfekiuy4/tmpf16i6leu to /opt/splunk/etc/system/local/server.conf: [Errno 13] Permission denied: b'/opt/splunk/etc/system/local/.ansible_tmpyks780qsserver.conf'

Currently using EKS 1.20 and have also tried EKS 1.18.
Persistent storage is on gp3, but we also tested gp2 with the same results.

Worker nodes are AWS Bottlerocket, but we also tried AmazonLinux2

What could we be possibly missing?
We've confirmed that the pvc and secrets do mount correctly and could not find any other error logs, both in cloudwatch and the operator container

@akondur
Copy link
Collaborator

akondur commented Jul 9, 2021

Hi @cicsyd, thanks for reporting the bug. To further investigate the issue, could you please provide the following information:
 

  1. The container logs for the operator pod via the command "kubectl logs <name_of_operator_pod>"
  2. The complete container logs for the splunk pod whose container is not starting via the command "kubectl logs <name_of_splunk_pod>"
  3. Details of the operating system i.e amazon OS version used(AMI details) of the worker node via the command "kubectl cluster-info dump"
  4. Copy of /etc/docker/daemon.json or /var/snap/docker/current/config/daemon.json depending upon how docker was installed
  5. docker related information via the command "docker info"
     
    On a local EKS worked node with the following details:
     
"kernelVersion": "4.14.214-160.339.amzn2.x86_64",
"osImage": "Amazon Linux 2",
"containerRuntimeVersion": "docker://19.3.6",
"kubeletVersion": "v1.17.12-eks-7684af",
"kubeProxyVersion": "v1.17.12-eks-7684af",
"operatingSystem": "linux",
"architecture": "amd64"

 
I was able to deploy Splunk Standalone, Clustered deployments successfully.
 
Alternatively, it could be an issue wrt the storage driver being used by docker underneath. If docker's default storage driver is aufs, the introduction of acl in recent images causes permission issues. Reference github issues(splunk/docker-splunk#105, splunk/docker-splunk#96) on docker-splunk.
 
If the underlying storage driver is aufs, could you try modifying it to overlay as per this comment here (splunk/docker-splunk#105 (comment))
 
Upgrading to the latest docker image has also fixed some permission issues in the past.
 
Looking forward to hearing from you.

@cicsyd
Copy link
Author

cicsyd commented Aug 30, 2021

Heya,

Turns out it was because we were using AWS Bottlerocket which mounts a read only file system, I suspect because when the splunk container gets created, it runs an install and tries to write to the file system, which is mounted as read only

We ended up getting it working while doing isolation testing with AL2

@cicsyd cicsyd closed this as completed Aug 30, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants