Need some indication of taskruns never leaving pending state before timeouts #779

gabemontero · 2024-07-19T17:41:40Z

Expected Behavior

A TaskRun never leaving Pending state, with its underlying pod started, should have this fact made clear in log storage

Actual Behavior

No such information occurs

Steps to Reproduce the Problem

Define ResourceQuotas such that Pods cannot be started in a namespace
Start a TaskRun with a timeout
Analyze Results after that TaskRun times out

Additional Info

With #699 we fixed the situation in general where if a timeout/cancel occurred, we would still go on to fetch/store the underlying pod logs.

However, in systems with quotas or severe node pressure at the k8s level, TaskRuns can stay stuck in Pending and any created Pods will never get started.

If you see the comments at

results/pkg/watcher/reconciler/dynamic/dynamic.go

Line 512 in c34e40d

    
           // KLUGE: tkn reader.Read() will raise an error if a step in the TaskRun failed and there is no

you'll see the prior observations of tkn making the distinction of errors difficult, and thus, errors with tkn getting logs are ignored.

That is proving unusable for users how may not have access to view events, pods, or etcd entities in general before the attempt to store logs occurs and then the pipelinerun/taskrun are potentially pruned form etcd.

before exiting the streamLogs code needs to confirm if any underlying pods for TaskRuns exist, and if not, store any helpful debug info in what is set to the GRPC UpdateLog call and/or direct S3 storage. In particular

the TaskRun yaml
a listing of Pods in the namespace in questions
a list of events for the TaskRun ... i.e. the eventList retrieved at

results/pkg/watcher/reconciler/dynamic/dynamic.go

Line 673 in c34e40d

eventList, err := json.Marshal(data)

I'll also attach a PR/TR which was timedout/cancelled where the taskrun never left Pending state.

You'll see from the annotations that they go from pending straight to a terminal state, meaning a pod never got associated.

pr-tr.zip

@khrm @sayan-biswas @avinal @enarha FYI / PTAL / WDYT

The text was updated successfully, but these errors were encountered:

gabemontero · 2024-07-19T21:19:46Z

revitalizing #715 would bypass the tkn client issues noted at

results/pkg/watcher/reconciler/dynamic/dynamic.go

Line 512 in c34e40d

    
           // KLUGE: tkn reader.Read() will raise an error if a step in the TaskRun failed and there is no

and allow us to indicate in the stored logs there were no pods to dump.

gabemontero added the kind/bug Categorizes issue or PR as related to a bug. label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need some indication of taskruns never leaving pending state before timeouts #779

Need some indication of taskruns never leaving pending state before timeouts #779

gabemontero commented Jul 19, 2024 •

edited

Loading

gabemontero commented Jul 19, 2024

Need some indication of taskruns never leaving pending state before timeouts #779

Need some indication of taskruns never leaving pending state before timeouts #779

Comments

gabemontero commented Jul 19, 2024 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

gabemontero commented Jul 19, 2024

gabemontero commented Jul 19, 2024 •

edited

Loading