cortex up times out when using region us-west-2 #2430

bddap · 2022-02-23T20:16:41Z

Version

cortex version
cli version: 0.42.0

Description

cortex up fails with "timeout has occurred when validating your cortex cluster". This happens consistently.

The failure only occurs with region: us-west-2. When region is set to us-east-2, cortex up succeeds.

Configuration

cortex up fails when using this cluster.yml:

cluster_name: this-config-fails
region: us-west-2
node_groups:
  - name: tmp
    instance_type: m5.large
    min_instances: 1
    max_instances: 5
    spot: false

while this cluster.yml succeeds:

cluster_name: this-config-works
region: us-east-2
node_groups:
  - name: tmp
    instance_type: m5.large
    min_instances: 1
    max_instances: 5
    spot: false

Steps to reproduce

run cortex up on the cluster.yml specified above, using us-west-2 as the region.

Expected behavior

cortex up to complete successfully, just as it does when region is us-east-2

Actual behavior

cortex up exits nonzero and reports a failure

Stack traces

failure trace

cortex cluster up ./<MASKED>/cluster.yaml
using aws credentials with access key <MASKED>

verifying your configuration ...

aws resource                            cost per hour
1 eks cluster                           $0.10
nodegroup tmp: 1-5 m5.large instances   $0.102 each
2 t3.medium instances (cortex system)   $0.088 total
1 t3.medium instance (prometheus)       $0.05
2 network load balancers                $0.045 total

your cluster will cost $0.38 - $0.79 per hour based on cluster size

cortex will also create an s3 bucket (this-config-fails-36f0f6ff) and a cloudwatch log group (this-config-fails)

would you like to continue? (y/n): y

￮ creating a new s3 bucket: this-config-fails-36f0f6ff ✓
￮ creating a new cloudwatch log group: this-config-fails ✓
￮ spinning up the cluster (this will take about 30 minutes) ...

2022-02-23 19:01:51 [ℹ]  eksctl version 0.67.0
2022-02-23 19:01:51 [ℹ]  using region us-west-2
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2b - public:192.168.32.0/19 private:192.168.128.0/19
2022-02-23 19:01:51 [ℹ]  subnets for us-west-2c - public:192.168.64.0/19 private:192.168.160.0/19
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-operator. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-operator" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-prometheus. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-prometheus" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [!]  Custom AMI detected for nodegroup cx-wd-tmp. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:01:51 [ℹ]  nodegroup "cx-wd-tmp" will use "ami-002539dd2c532d0a5" [AmazonLinux2/1.21]
2022-02-23 19:01:51 [ℹ]  using Kubernetes version 1.21
2022-02-23 19:01:51 [ℹ]  creating EKS cluster "this-config-fails" in "us-west-2" region with un-managed nodes
2022-02-23 19:01:51 [ℹ]  3 nodegroups (cx-operator, cx-prometheus, cx-wd-tmp) were included (based on the include/exclude rules)
2022-02-23 19:01:51 [ℹ]  will create a CloudFormation stack for cluster itself and 3 nodegroup stack(s)
2022-02-23 19:01:51 [ℹ]  will create a CloudFormation stack for cluster itself and 0 managed nodegroup stack(s)
2022-02-23 19:01:51 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ]  CloudWatch logging will not be enabled for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=this-config-fails'
2022-02-23 19:01:51 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "this-config-fails" in "us-west-2"
2022-02-23 19:01:51 [ℹ]  2 sequential tasks: { create cluster control plane "this-config-fails", 3 sequential sub-tasks: { 2 sequential sub-tasks: { wait for control plane to become ready, tag cluster }, 1 task: { create addons }, 3 parallel sub-tasks: { create nodegroup "cx-operator", create nodegroup "cx-prometheus", create nodegroup "cx-wd-tmp" } } }
2022-02-23 19:01:51 [ℹ]  building cluster stack "eksctl-this-config-fails-cluster"
2022-02-23 19:01:52 [ℹ]  deploying stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:22 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:02:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:03:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:04:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:05:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:06:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:07:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:08:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:09:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:10:52 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:11:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:12:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:13:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:14:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-cluster"
2022-02-23 19:16:54 [✔]  tagged EKS cluster (cortex.dev/cluster-name=this-config-fails)
2022-02-23 19:18:55 [!]  OIDC is disabled but policies are required/specified for this addon. Users are responsible for attaching the policies to all nodegroup roles
2022-02-23 19:18:55 [ℹ]  creating addon
2022-02-23 19:23:25 [ℹ]  addon "vpc-cni" active
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-wd-tmp, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-operator, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:25 [ℹ]  building nodegroup stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:25 [!]  Custom AMI detected for nodegroup cx-prometheus, using legacy nodebootstrap mechanism. Please refer to https://github.com/weaveworks/eksctl/issues/3563 for upcoming breaking changes
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:26 [ℹ]  deploying stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:26 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:42 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:23:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:23:57 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:23:59 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:02 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:16 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:16 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:20 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:33 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:36 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:40 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:24:51 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:24:53 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:24:55 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:10 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:11 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:11 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:25 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:28 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:30 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:25:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:25:44 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:25:50 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:00 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:00 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:10 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:17 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:20 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:25 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:33 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:40 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:41 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:51 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-wd-tmp"
2022-02-23 19:26:56 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-operator"
2022-02-23 19:26:57 [ℹ]  waiting for CloudFormation stack "eksctl-this-config-fails-nodegroup-cx-prometheus"
2022-02-23 19:26:57 [ℹ]  waiting for the control plane availability...
2022-02-23 19:26:57 [✔]  saved kubeconfig as "/root/.kube/config"
2022-02-23 19:26:57 [ℹ]  1 task: { suspend ASG processes for nodegroup cx-wd-tmp }
2022-02-23 19:26:58 [ℹ]  suspended ASG processes [AZRebalance] for cx-wd-tmp
2022-02-23 19:26:58 [✔]  all EKS cluster resources for "this-config-fails" have been created
2022-02-23 19:26:58 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-OG2YBA75HYPE" to auth ConfigMap
2022-02-23 19:26:58 [ℹ]  nodegroup "cx-operator" has 0 node(s)
2022-02-23 19:26:58 [ℹ]  waiting for at least 2 node(s) to become ready in "cx-operator"
2022-02-23 19:27:30 [ℹ]  nodegroup "cx-operator" has 2 node(s)
2022-02-23 19:27:30 [ℹ]  node "ip-192-168-20-129.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ]  node "ip-192-168-88-85.us-west-2.compute.internal" is ready
2022-02-23 19:27:30 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-75KAP1SXG8SQ" to auth ConfigMap
2022-02-23 19:27:30 [ℹ]  nodegroup "cx-prometheus" has 0 node(s)
2022-02-23 19:27:30 [ℹ]  waiting for at least 1 node(s) to become ready in "cx-prometheus"
2022-02-23 19:28:32 [ℹ]  nodegroup "cx-prometheus" has 1 node(s)
2022-02-23 19:28:32 [ℹ]  node "ip-192-168-54-32.us-west-2.compute.internal" is ready
2022-02-23 19:28:32 [ℹ]  adding identity "arn:aws:iam::<MASKED>:role/eksctl-this-config-fails-nodegrou-NodeInstanceRole-JLK3EF72JQAV" to auth ConfigMap
2022-02-23 19:28:32 [ℹ]  nodegroup "cx-wd-tmp" has 0 node(s)
2022-02-23 19:28:32 [ℹ]  waiting for at least 1 node(s) to become ready in "cx-wd-tmp"
2022-02-23 19:31:00 [ℹ]  nodegroup "cx-wd-tmp" has 1 node(s)
2022-02-23 19:31:00 [ℹ]  node "ip-192-168-72-237.us-west-2.compute.internal" is ready
2022-02-23 19:33:01 [ℹ]  kubectl command should work with "/root/.kube/config", try 'kubectl get nodes'
2022-02-23 19:33:01 [✔]  EKS cluster "this-config-fails" in "us-west-2" region is ready

￮ updating cluster configuration ✓
￮ configuring networking (this will take a few minutes) ✓
￮ configuring autoscaling ✓
￮ configuring async gateway ✓
￮ configuring logging ✓
￮ configuring metrics ✓
￮ configuring gpu support (for nodegroups that may require it) ✓
￮ configuring inf support (for nodegroups that may require it) ✓
￮ starting operator ✓
￮ starting controller manager ✓
￮ waiting for load balancers .............................................................................................................................................................................................................................................................................................................................................

timeout has occurred when validating your cortex cluster

debugging info:
operator pod name: pod/operator-controller-manager-6f8bb85b96-clqxf
operator pod is ready: true
operator endpoint: <MASKED>.elb.us-west-2.amazonaws.com
noperator curl response:
{}additional networking events:
LAST SEEN   TYPE     REASON                 OBJECT                            MESSAGE
30m         Normal   EnsuringLoadBalancer   service/ingressgateway-apis       Ensuring load balancer
30m         Normal   EnsuredLoadBalancer    service/ingressgateway-apis       Ensured load balancer
30m         Normal   EnsuringLoadBalancer   service/ingressgateway-operator   Ensuring load balancer
30m         Normal   EnsuredLoadBalancer    service/ingressgateway-operator   Ensured load balancer
30m         Normal    Scheduled   pod/ingressgateway-apis-69465f9956-gzxtf       Successfully assigned istio-system/ingressgateway-apis-69465f9956-gzxtf to ip-192-168-20-129.us-west-2.compute.internal
30m         Normal    Pulling     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m         Normal    Pulling     pod/ingressgateway-apis-69465f9956-gzxtf       Pulling image "quay.io/cortexlabs/istio-proxy:0.42.0"
30m         Normal    Pulled      pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 4.987000991s
30m         Normal    Created     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Created container istio-proxy
30m         Normal    Started     pod/ingressgateway-operator-7b54fcf5cd-gsvcb   Started container istio-proxy
30m         Normal    Pulled      pod/ingressgateway-apis-69465f9956-gzxtf       Successfully pulled image "quay.io/cortexlabs/istio-proxy:0.42.0" in 6.764940388s
30m         Normal    Started     pod/ingressgateway-apis-69465f9956-gzxtf       Started container istio-proxy
30m         Normal    Created     pod/ingressgateway-apis-69465f9956-gzxtf       Created container istio-proxy
30m         Warning   Unhealthy   pod/ingressgateway-apis-69465f9956-gzxtf       Readiness probe failed: Get "http://192.168.3.27:15021/healthz/ready": dial tcp 192.168.3.27:15021: connect: connection refused


please run `cortex cluster down` to delete the cluster before trying to create this cluster again

Additional context

I've only tested us-west-2 and us-east-2 so far. I've repeated the experiment a number of times. I see consistent failure when region is us-west-2 and consistent success when region is us-east-2.

A search in the slack channel for timeout has occurred when validating your cortex cluster shows that this issue is pretty common. I see four or five reports of the issue in the last year.

us-west-2 is my default region.

The text was updated successfully, but these errors were encountered:

bddap · 2022-02-23T21:00:00Z

cortex/manager/install.sh

Line 461 in 89ba2f4

function validate_cortex() {

deliahu · 2022-02-26T07:00:34Z

I just tried creating a new cluster with the cluster configuration you provided, and it worked for me in us-west-2. I ran this from the master branch, but there have not been any changes that should affect the cluster creation process since the v0.42.0 release. Do you mind trying again?

bddap · 2022-07-20T19:59:37Z

Tried again. Same results.

bddap added the bug Something isn't working label Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cortex up times out when using region us-west-2 #2430

cortex up times out when using region us-west-2 #2430

bddap commented Feb 23, 2022

bddap commented Feb 23, 2022

deliahu commented Feb 26, 2022

bddap commented Jul 20, 2022

cortex up times out when using region us-west-2 #2430

cortex up times out when using region us-west-2 #2430

Comments

bddap commented Feb 23, 2022

Version

Description

Configuration

Steps to reproduce

Expected behavior

Actual behavior

Stack traces

Additional context

bddap commented Feb 23, 2022

deliahu commented Feb 26, 2022

bddap commented Jul 20, 2022