topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources #92

askervin · 2023-07-10T08:43:06Z

Assume that a container runs on CPUs of NUMA node 0.

An admin wants to reorganize server resources so that containers will not use CPUs on NUMA/die/socket 0 anymore by removing those CPUs from AvailableResources.

When this is done, restarting the topology aware NRI plugin with new configuration fails with an internal error:

E0710 07:30:57.289447       1 nri.go:784] <= Synchronize FAILED: failed to start policy topology-aware: topology-aware: failed to start:
topology-aware: failed to restore allocations from cache:
topology-aware: failed to allocate <CPU request pod0/pod0c0: exclusive: 3><Memory request: limit:95.37M, req:95.37M> from <NUMA node #1 allocatable: MemLimit: DRAM 1.85G>:
topology-aware: internal error: NUMA node #1: can't slice 3 exclusive CPUs from , 0m available

Let's discuss if this is a bug, expected behavior or if we should provide a configuration option for forcing new CPU/memory pinning, even if it would lead into costly memory accesses/moves.

Current workaround on this error is deleting the cache and thereby forcing reassignment of resources from scratch. Using this workaround or draining a node before AvailableResources change are both heavier operations than what forcing new pinning would be.

The text was updated successfully, but these errors were encountered:

Test that a running container gets reassigned into new CPUs when the CPUs where it used to run are not included in AvailableResources anymore. Tests issue containers#92.

askervin mentioned this issue Jul 10, 2023

e2e: test topology-aware NUMA node change by AvailableResources #93

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources #92

topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources #92

askervin commented Jul 10, 2023

topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources #92

topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources #92

Comments

askervin commented Jul 10, 2023