Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

cortex up times out when using region us-west-2 #2430

Open
bddap opened this issue Feb 23, 2022 · 3 comments
Open

cortex up times out when using region us-west-2 #2430

bddap opened this issue Feb 23, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@bddap
Copy link

bddap commented Feb 23, 2022

Version

cortex version
cli version: 0.42.0

Description

cortex up fails with "timeout has occurred when validating your cortex cluster". This happens consistently.

The failure only occurs with region: us-west-2. When region is set to us-east-2, cortex up succeeds.

Configuration

cortex up fails when using this cluster.yml:

cluster_name: this-config-fails
region: us-west-2
node_groups:
  - name: tmp
    instance_type: m5.large
    min_instances: 1
    max_instances: 5
    spot: false

while this cluster.yml succeeds:

cluster_name: this-config-works
region: us-east-2
node_groups:
  - name: tmp
    instance_type: m5.large
    min_instances: 1
    max_instances: 5
    spot: false

Steps to reproduce

  1. run cortex up on the cluster.yml specified above, using us-west-2 as the region.

Expected behavior

cortex up to complete successfully, just as it does when region is us-east-2

Actual behavior

cortex up exits nonzero and reports a failure

Stack traces

failure trace
cortex cluster up ./<MASKED>/cluster.yaml
using aws credentials with access key <MASKED>

verifying your configuration ...

aws resource                            cost per hour
1 eks cluster                           $0.10
nodegroup tmp: 1-5 m5.large instances   $0.102 each
2 t3.medium instances (cortex system)   $0.088 total
1 t3.medium instance (prometheus)       $0.05
2 network load balancers                $0.045 total

your cluster will cost $0.38 - $0.79 per hour based on cluster size

cortex will also create an s3 bucket (this-config-fails-36f0f6ff) and a cloudwatch log group (this-config-fails)

would you like to continue? (y/n): y

○ creating a new s3 bucket: this-config-fails-36f0f6ff ✓
○ creating a new cloudwatch log group: this-config-fails ✓
○ spinning up the cluster (this will take about 30 minutes) ...

2022-02-23 19:01:51 [ℹ]  eksctl version 0.67.0
2022-02-23 19:01:51 [ℹ]  using region us-west-2
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2b - public:192.168.32.0/19 private:192.168.128.0/19
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2c - public:192.168.64.0/19 private:192.168.160.0/19
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-operator. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-operator" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-prometheus. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-prometheus" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-wd-tmp. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-wd-tmp" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [ℹ]  using Kubernetes version 1.21
2022-02-23 19:01:51 [ℹ]  creating EKS cluster "this-config-fails" in "us-west-2" region with un-managed nodes
2022-02-23 19:01:51 [ℹ]  3 nodegroups (cx-operator, cx-prometheus, cx-wd-tmp) were included (based on the include/exclude rules)
2022-02-23 19:01:51 [ℹ]  will create a CloudFormation stack for cluster itself and 3 nodegroup stack(s)
2022-02-23 19:01:51 [ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
2022-02-23 19:01:51 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ]  CloudWatch logging will not be enabled for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ]  2 sequential tasks: { create cluster control plane "this-config-fails", 3 sequential sub-tasks: { 2 sequential sub-tasks: { wait for control plane to become ready, tag cluster }, 1 task: { create addons }, 3 parallel sub-tasks: { create nodegroup "cx-operator", create nodegroup "cx-prometheus", create nodegroup "cx-wd-tmp" } } }
2022-02-23 19:01:51 [ℹ]  building cluster stack "eksctl-this-config-fails-cluster"
2022-02-23 19:01:52 [ℹ]  deploying stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:22 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:03:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:04:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:05:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:06:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:07:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:08:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:09:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:10:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:11:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:12:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:13:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:14:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:16:54 [✔]  tagged EKS cluster (cortex.dev/cluster-name=this-config-fails)
2022-02-23 19:18:55 [!]  OIDC is disabled but policies are required/specified for this addon. Users are responsible for attaching the policies to all nodegroup roles
2022-02-23 19:18:55 [ℹ]  creating addon
2022-02-23 19:23:25 [ℹ]  addon "vpc-cni" active
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-wd-tmp, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-operator, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-prometheus, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:42 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:57 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:59 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:02 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:16 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:16 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:20 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:33 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:36 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:40 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:51 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:55 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:10 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:11 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:11 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:25 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:28 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:30 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:50 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:00 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:00 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:10 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:17 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:20 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:25 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:33 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:40 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:41 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:51 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:56 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:57 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:57 [ℹ]  waiting for the control plane availability...
2022-02-23 19:26:57 [✔]  saved kubeconfig as "/root/.kube/config"
2022-02-23 19:26:57 [ℹ]  1 task: { suspend ASG processes for nodegroup cx-wd-tmp }
2022-02-23 19:26:58 [ℹ]  suspended ASG processes [AZRebalance] for cx-wd-tmp
2022-02-23 19:26:58 [✔]  all EKS cluster resources for "this-config-fails" have been created
2022-02-23 19:26:58 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-OG2YBA75HYPE" to auth ConfigMap
2022-02-23 19:26:58 [ℹ]  nodegroup "cx-operator" has 0 node(s)
2022-02-23 19:26:58 [ℹ]  waiting for at least 2 node(s) to become ready in "cx-operator"
2022-02-23 19:27:30 [ℹ]  nodegroup "cx-operator" has 2 node(s)
2022-02-23 19:27:30 [ℹ]  node "ip-192-168-20-129.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ]  node "ip-192-168-88-85.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-75KAP1SXG8SQ" to auth ConfigMap
2022-02-23 19:27:30 [ℹ]  nodegroup "cx-prometheus" has 0 node(s)
2022-02-23 19:27:30 [ℹ]  waiting for at least 1 node(s) to become ready in "cx-prometheus"
2022-02-23 19:28:32 [ℹ]  nodegroup "cx-prometheus" has 1 node(s)
2022-02-23 19:28:32 [ℹ]  node "ip-192-168-54-32.us-west-2.compute.internal" is ready
2022-02-23 19:28:32 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-JLK3EF72JQAV" to auth ConfigMap
2022-02-23 19:28:32 [ℹ]  nodegroup "cx-wd-tmp" has 0 node(s)
2022-02-23 19:28:32 [ℹ]  waiting for at least 1 node(s) to become ready in "cx-wd-tmp"
2022-02-23 19:31:00 [ℹ]  nodegroup "cx-wd-tmp" has 1 node(s)
2022-02-23 19:31:00 [ℹ]  node "ip-192-168-72-237.us-west-2.compute.internal" is ready
2022-02-23 19:33:01 [ℹ]  kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2022-02-23 19:33:01 [✔]  EKS cluster "this-config-fails" in "us-west-2" region is ready

○ updating cluster configuration ✓
○ configuring networking (this will take a few minutes) ✓
○ configuring autoscaling ✓
○ configuring async gateway ✓
○ configuring logging ✓
○ configuring metrics ✓
○ configuring gpu support (for nodegroups that may require it) ✓
○ configuring inf support (for nodegroups that may require it) ✓
○ starting operator ✓
○ starting controller manager ✓
○ waiting for load balancers .............................................................................................................................................................................................................................................................................................................................................

timeout has occurred when validating your cortex cluster

debugging info:
operator pod name: pod/operator-controller-manager-6f8bb85b96-clqxf
operator pod is ready: true
operator endpoint: <MASKED>.elb.us-west-2.amazonaws.com
noperator curl response:
{}additional networking events:
LAST SEEN   TYPE     REASON                 OBJECT                            MESSAGE
30m         Normal   EnsuringLoadBalancer   service/ingressgateway-apis       Ensuring load balancer
30m         Normal   EnsuredLoadBalancer    service/ingressgateway-apis       Ensured load balancer
30m         Normal   EnsuringLoadBalancer   service/ingressgateway-operator   Ensuring load balancer
30m         Normal   EnsuredLoadBalancer    service/ingressgateway-operator   Ensured load balancer
30m         Normal    Scheduled   pod/ingressgateway-apis-69465f9956-gzxtf       Successfully assigned istio-system/ingressgateway-apis-69465f9956-gzxtf to ip-192-168-20-129.us-west-2.compute.internal
30m         Normal    Pulling     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m         Normal    Pulling     pod/ingressgateway-apis-69465f9956-gzxtf       Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m         Normal    Pulled      pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 4.987000991s
30m         Normal    Created     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Created container istio-proxy
30m         Normal    Started     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Started container istio-proxy
30m         Normal    Pulled      pod/ingressgateway-apis-69465f9956-gzxtf       Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 6.764940388s
30m         Normal    Started     pod/ingressgateway-apis-69465f9956-gzxtf       Started container istio-proxy
30m         Normal    Created     pod/ingressgateway-apis-69465f9956-gzxtf       Created container istio-proxy
30m         Warning   Unhealthy   pod/ingressgateway-apis-69465f9956-gzxtf       Readiness probe failed: Get "http://192.168.3.27:15021/healthz/ready": dial tcp 192.168.3.27:15021: connect: connection refused


please run `cortex cluster down` to delete the cluster before trying to create this cluster again

Additional context

I've only tested us-west-2 and us-east-2 so far. I've repeated the experiment a number of times. I see consistent failure when region is us-west-2 and consistent success when region is us-east-2.

A search in the slack channel for timeout has occurred when validating your cortex cluster shows that this issue is pretty common. I see four or five reports of the issue in the last year.

us-west-2 is my default region.

@bddap bddap added the bug Something isn't working label Feb 23, 2022
@bddap
Copy link
Author

bddap commented Feb 23, 2022

function validate_cortex() {

@deliahu
Copy link
Member

deliahu commented Feb 26, 2022

I just tried creating a new cluster with the cluster configuration you provided, and it worked for me in us-west-2. I ran this from the master branch, but there have not been any changes that should affect the cluster creation process since the v0.42.0 release. Do you mind trying again?

@bddap
Copy link
Author

bddap commented Jul 20, 2022

Tried again. Same results.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants