Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[env] Do not share config between tests #396

Merged

Conversation

Fricounet
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Following discussions in #367

This PR fixes a bug where the config object was shared between tests. When running tests in parallel, it was then impossible to rely on fields like the namespace because they could have been overwritten by another test. Also, it led to tests using -race to fail because the shared config.klient object that were updating the same fields when initializing the client.

This PR creates a new deepCopyConfig method that allows to create a deep copy of the config object. This way, each test can have its own without impacting the others. Now the config has the following lifecycle:

  • It is created when a testEnv is created
  • Each test uses a child testEnv that inherits the main testEnv's config
  • Each feature uses a child testEnv that inherits the test's testEnv's config

This way, a feature inherits all the changes made in BeforeEach functions while not impacting the other features.

Which issue(s) this PR fixes:

Fixes #352
Fixes #384

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Improve how Config is shared across tests so that it can be used without any issues in parallel tests

Additional documentation e.g., Usage docs, etc.:


This commit fixes a bug where the config object was shared between tests.
When running tests in parallel, it was then impossible to rely on fields
like the namespace because they could have been overwritten by another
test. Also, it led to tests using `-race` to fail because the shared
config.klient object that were updating the same fields when initializing
the client.

This commit creates a new `deepCopyConfig` method that allows to create
a deep copy of the config object. This way, each test can have its own
without impacting the others. Now the config has the following lifecycle:
- It is created when a testEnv is created
- Each test uses a child testEnv that inherits the main testEnv's config
- Each feature uses a child testEnv that inherits the test's testEnv's config

This way, a feature inherits all the changes made in BeforeEach functions
while not impacting the other features.
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 8, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Fricounet. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from cpanato April 8, 2024 16:01
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Apr 8, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 8, 2024
@Fricounet
Copy link
Contributor Author

/assign @vladimirvivien

Copy link
Contributor

@phisco phisco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Fricounet! @maruina pinged me to have a look at this, I do agree a shared-nothing approach here makes more sense than to have a lock on the whole config and to remember to handle it properly, left a few minor comments 🙏

pkg/env/env.go Outdated
return &testEnv{
ctx: e.ctx,
cfg: e.deepCopyConfig(),
actions: e.actions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a new slice here as we do in WithContext, to avoid possible race conditions on that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, it's not a perfect copy though: since the underlying funcs, featureFuncs and testFuncs fields of each action are also slices, they are still shared. But I don't think there is any way in the current code to alter those slices after creation so it should not cause any issue even if we just copy the []action slice.

Would you prefer a deep copy of the actions as well to be safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've copied the actions like it is done in WithContext for now and I've also created a child context so that none of the fields of the testEnv are shared

// deepCopyFeature just copies the values from the Feature but creates a deep
// deepCopyConfig just copies the values from the Config to create a deep
// copy to avoid mutation when we just want an informational copy.
func (e *testEnv) deepCopyConfig() *envconf.Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we do something similar for deepCopyFeature, but in that case it's kind of simpler and so the risk to forget to update it for new fields is lower, here I feel it would be higher, so would it make sense to either autogenerate or rely on reflect.DeepCopy directly? The performance hit would not be so relevant here probably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using xreflect.DeepCopy however it does not quite work:

  • it panics when one of the struct fields of Config is nil with panic: reflect: call of reflect.Value.Set on zero Value
  • it panics when trying to copy the regex values with panic: reflect.Set: value of type *syntax.Prog is not assignable to type *regexp.onePassProg
    I didn't try to go deeper but I think a bunch of fixes would be needed in xreflect to be able to use it 😞

Concerning, if we can autogenerate it, I've not considered it. Mainly because I've no idea how to do it 😅
and I don't know if it would be able to do the deep copy on complex struct fields like the regex ones.
Still, if you think it would work, I can try. In that case, could you link some doc I can use to get started?

What do you think?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Fricounet
Once this PR has been reviewed and has the lgtm label, please ask for approval from vladimirvivien. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pmalek
Copy link
Contributor

pmalek commented Apr 16, 2024

I've tested this with a simple test (from #352) by checking out locally an adding a replace:

replace sigs.k8s.io/e2e-framework => ../e2e-framework
package main

import (
	"context"
	"os"
	"testing"

	corev1 "k8s.io/api/core/v1"
	"sigs.k8s.io/e2e-framework/klient/conf"
	"sigs.k8s.io/e2e-framework/pkg/env"
	"sigs.k8s.io/e2e-framework/pkg/envconf"
	"sigs.k8s.io/e2e-framework/pkg/features"
)

var tenv env.Environment

func TestMain(m *testing.M) {
	cfg, err := envconf.NewFromFlags()
	if err != nil {
		panic(err)
	}
	cfg.WithKubeconfigFile(conf.ResolveKubeConfigFile())
	tenv = env.NewWithConfig(cfg)

	os.Exit(tenv.Run(m))
}

func Setup(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
	res := c.Client().Resources()
	podList := corev1.PodList{}
	err := res.List(ctx, &podList)
	if err != nil {
		t.Fatalf("failed listing pods: %v", err)
	}
	return ctx
}

func Assess(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
	return ctx
}

func TestX1(t *testing.T) {
	t.Parallel()
	tenv.Test(t, generateFeat())
}

func TestX2(t *testing.T) {
	t.Parallel()
	tenv.Test(t, generateFeat())
}

func generateFeat() features.Feature {
	return features.New("feature").
		WithSetup("setup", Setup).
		Assess("A", Assess).
		Feature()
}

( I needed to add additional replace directives to work around #399

and I still get a data race:

go test -v -race -count 1 -trimpath .
=== RUN   TestX1
=== PAUSE TestX1
=== RUN   TestX2
=== PAUSE TestX2
=== CONT  TestX2
=== CONT  TestX1
==================
WARNING: DATA RACE
Write at 0x00c00030ea80 by goroutine 26:
  sigs.k8s.io/e2e-framework/pkg/envconf.(*Config).Client()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/envconf/config.go:141 +0x9c
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).deepCopyConfig()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:601 +0x100
  sigs.k8s.io/e2e-framework/pkg/env.newChildTestEnv()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:130 +0xe4
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).processTests()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:260 +0x48
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Test()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:346 +0x60
  e2e-framework-bug-352.TestX1()
      e2e-framework-bug-352/main_test.go:44 +0xb8
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.(*T).Run.gowrap1()
      testing/testing.go:1742 +0x40

Previous write at 0x00c00030ea80 by goroutine 27:
  sigs.k8s.io/e2e-framework/pkg/envconf.(*Config).Client()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/envconf/config.go:141 +0x9c
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).deepCopyConfig()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:601 +0x100
  sigs.k8s.io/e2e-framework/pkg/env.newChildTestEnv()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:130 +0xe4
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).processTests()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:260 +0x48
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Test()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:346 +0x60
  e2e-framework-bug-352.TestX2()
      e2e-framework-bug-352/main_test.go:49 +0xb8
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.(*T).Run.gowrap1()
      testing/testing.go:1742 +0x40

Goroutine 26 (running) created at:
  testing.(*T).Run()
      testing/testing.go:1742 +0x5e4
  testing.runTests.func1()
      testing/testing.go:2161 +0x80
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.runTests()
      testing/testing.go:2159 +0x6e0
  testing.(*M).Run()
      testing/testing.go:2027 +0xb74
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Run()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:408 +0x4c0
  e2e-framework-bug-352.TestMain()
      e2e-framework-bug-352/main_test.go:25 +0x250
  main.main()
      _testmain.go:51 +0x294

Goroutine 27 (running) created at:
  testing.(*T).Run()
      testing/testing.go:1742 +0x5e4
  testing.runTests.func1()
      testing/testing.go:2161 +0x80
  testing.tRunner()
      testing/testing.go:1689 +0x180
  testing.runTests()
      testing/testing.go:2159 +0x6e0
  testing.(*M).Run()
      testing/testing.go:2027 +0xb74
  sigs.k8s.io/e2e-framework/pkg/env.(*testEnv).Run()
      sigs.k8s.io/e2e-framework@v0.3.0/pkg/env/env.go:408 +0x4c0
  e2e-framework-bug-352.TestMain()
      e2e-framework-bug-352/main_test.go:25 +0x250
  main.main()
      _testmain.go:51 +0x294
==================
=== RUN   TestX2/feature
=== RUN   TestX1/feature
=== NAME  TestX2
    testing.go:1398: race detected during execution of test
=== NAME  TestX1
    testing.go:1398: race detected during execution of test
=== RUN   TestX2/feature/A
=== RUN   TestX1/feature/A
--- FAIL: TestX2 (0.04s)
    --- PASS: TestX2/feature (0.03s)
        --- PASS: TestX2/feature/A (0.00s)
--- FAIL: TestX1 (0.04s)
    --- PASS: TestX1/feature (0.03s)
        --- PASS: TestX1/feature/A (0.00s)
FAIL
FAIL    e2e-framework-bug-352   0.532s
FAIL

@Fricounet
Copy link
Contributor Author

@pmalek indeed 🤔
So it's the client again. This stems from the fact that Client() function is not a proper getter function like the other Config accessors but a mix of a setter and a getter. As a result, when it is accessed the first time by both tests, it is nil and they both try to create it at the same time. I can think of a few ways of solving this:

  1. Create a new client getter function like GetClient() which would simply acts as a getter for the client and use it in deepCopyConfig instead of Client()
  2. Change the Client() function to be a proper getter and find a new name for the current Client(), maybe MustNewClient() since it panics in case of error?
  3. Use a lock on the client. What you wanted to do here feat: make Config's Client() goroutine safe #367. This would need some additional changes to make the lock available in the env package but I think that could work

Each has its own quirks:

  1. is the easiest to setup and doesn't break the API. Drawback is that it is not inline with the current setter/getter convention on the Config object
  2. Change the API to have saner and clearer method names. But breaks the API. However, since we are still in versions v0.X.Y, I think breaking the API can be considered fine and arguably, breaking it now to have something more future proof is OK in my opinion
  3. I don't like using locks where we can avoid it especially since this lock will be exported outside of the config pkg to be accessible in env. So anybody would be able to play around with this lock while it should stay an implementation detail

For now, I went with 1) because it is the easiest to do but I'd rather do 2) if we say it is fine to break the API

@phisco
Copy link
Contributor

phisco commented Apr 16, 2024

I think client caching also has another issue, so while at it, might be worth addressing that too: if one gets the client, modifies the kubeconfig or any other client related parameter in the config, and then gets the client again, they would get the original client and not a new one with the new kubeconfig. We could probably remove caching altogether and solve both issues.

@pmalek
Copy link
Contributor

pmalek commented Apr 16, 2024

I think client caching also has another issue, so while at it, might be worth addressing that too: if one gets the client, modifies the kubeconfig or any other client related parameter in the config, and then gets the client again, they would get the original client and not a new one with the new kubeconfig. We could probably remove caching altogether and solve both issues.

If it's possible to go this path I'd vote for that option. 👍

@Fricounet
Copy link
Contributor Author

@phisco @pmalek I got rid of the caching by removing it altogether.
Drawback being that the client will be recreated every time it is accessed with config.Client(). I tried to rename my GetClient function to Client in order to have a simple getter that won't recreate the client each time. However it creates issues where the config.client object is never initialized because callers of Client do not expect to have to initialize it themselves and we cannot initialize it ourselves within the New... contructor functions because it happens to early and clusters might not be available yet (see #57). My try is available here if you want to take a look.

In any case, removing the caching altogether seems like an ok solution if we are fine with the performance cost induced by having to recreate the client all the time.

@pmalek
Copy link
Contributor

pmalek commented Apr 20, 2024

Drawback being that the client will be recreated every time it is accessed with config.Client()

I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.

@phisco
Copy link
Contributor

phisco commented Apr 20, 2024

Drawback being that the client will be recreated every time it is accessed with config.Client()

I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.

+1, being this a testing framework I feel that shouldn't be the focus, but the call is to the maintainers to make for sure.

@harshanarayana
Copy link
Contributor

I am personally not afraid of the performance costs of creating the client each time (it should be negligible, right?). Haven't measured this though.

I personally don't think this overhead in performance will be that significant if we create client each time. But we should probably keep in mind the resource usage in terms of memory. Just to account for cases where one runs these in a sandbox-ed container env. We don't want them getting OOM'ed suddenly out of nowhere right ?

@Fricounet
Copy link
Contributor Author

@harshanarayana this PR does indeed create some memory overhead. However, I did some profiling and observed the following:

  • it only impacts parallel tests since serial tests will only create 1 Config object at a time
  • each config seems to weighs around 600kB in memory.
    • Running 1 test in parallel:
      image
    • Running 10 tests in parallel:
      image
    • Running 100 tests in parallel:
      image
  • this has nothing to do with caching the client or not. It is the fact that a new Config object has to be created for each test/feature that creates the overhead (which is the core reason for this PR in the first place).

Considering this, I think the overhead is not a big issue because it needs heavy parallelization to show a big impact. I personally don't see a situation where a system able to run 100 parallel tests would be restricted by 70MB of RAM. And considering that without the changes brought by this PR, running parallel tests with the current framework is simply not practical because of all the race conditions due to the shared config, I don't think we are losing anything here:

  • serial tests don't have a noticeable overhead
  • parallel tests can now be run without race conditions on the Config.

What do you think?

If you want to reproduce the profiling I did, I used the following code:

package main

import (
	"context"
	"flag"
	"fmt"
	"os"
	"runtime/pprof"
	"testing"
	"time"

	"sigs.k8s.io/e2e-framework/klient/conf"
	"sigs.k8s.io/e2e-framework/pkg/env"
	"sigs.k8s.io/e2e-framework/pkg/envconf"
	"sigs.k8s.io/e2e-framework/pkg/features"
)

var tenv env.Environment

var nParallelTests = flag.Int("n", 100, "Number of parallel tests to run")

func TestMain(m *testing.M) {
	cfg, err := envconf.NewFromFlags()
	if err != nil {
		panic(err)
	}
	cfg.WithKubeconfigFile(conf.ResolveKubeConfigFile())
	cfg.WithParallelTestEnabled()
	tenv = env.NewWithConfig(cfg)

	os.Exit(tenv.Run(m))
}

func Setup(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
	c.Client().Resources()
	time.Sleep(5 * time.Second)
	return ctx
}

func Assess(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
	return ctx
}

func TestX1(t *testing.T) {
	t.Parallel()
	feats := []features.Feature{}
	for i := 0; i < *nParallelTests; i++ {
		feats = append(feats, generateFeat())

	}
	tenv.TestInParallel(t, feats...)
}

func TestX2(t *testing.T) {
	t.Parallel()
	// Making sure all parallel features in TestX1 have stared and created their client
	time.Sleep(2 * time.Second)
	tenv.Test(t, features.New("feature").
		Assess("A", func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
			file, err := os.Create(fmt.Sprintf("heap-%d.prof", *nParallelTests))
			if err != nil {
				t.Fatal(err)
			}
			defer file.Close()
			err = pprof.WriteHeapProfile(file)
			if err != nil {
				t.Fatal(err)
			}
			return ctx
		}).
		Feature())
}

func generateFeat() features.Feature {
	return features.New("feature").
		WithSetup("setup", Setup).
		Assess("A", Assess).
		Feature()
}

then N=10 go test ./ -test.v -race -kubeconfig $KUBECONFIG -args -n $N && go tool pprof -web -focus env heap-$N.prof

@pmalek
Copy link
Contributor

pmalek commented Apr 22, 2024

Considering this, I think the overhead is not a big issue because it needs heavy parallelization to show a big impact. I personally don't see a situation where a system able to run 100 parallel tests would be restricted by 70MB of RAM. And considering that without the changes brought by this PR, running parallel tests with the current framework is simply not practical because of all the race conditions due to the shared config, I don't think we are losing anything here:

💯

This can always be left as an exercise for those that would like to optimize this later on. This PR will give users the possibility to run tests in parallel without data races. Let's focus on that. If there's ever need ( I doubt that ) to make this more efficient, we can work on that then.

Thanks for all the tests that you've done @Fricounet 👍

@vladimirvivien
Copy link
Contributor

I appreciate the thoroughness of the analysis in this PR. Thank you all.

@harshanarayana
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link
Contributor

@Fricounet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-e2e-framework-verify 95976f7 link true /test pull-e2e-framework-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Comment on lines 133 to 139
func (c *Config) Client() klient.Client {
if c.client != nil {
return c.client
}

client, err := klient.NewWithKubeConfigFile(c.kubeconfig)
client, err := c.NewClient()
if err != nil {
panic(fmt.Errorf("envconfig: client failed: %w", err).Error())
panic(err)
}
c.client = client
return c.client
return client
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ I actually realised this would prevent injecting a client without a kubeconfig in the Config. As we would always have to go through klient.NewWithKubeConfigFile(c.kubeconfig) instead of relying on the injected one, which afaict is a supported usecase at the moment!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #413 for an attempt at solving this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a real example of this being an issue?
Because currently, I can't figure out a situation where this happens. And you can still do GetClient to retrieve the client right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that this would be a behavior change, currently someone creating a config with an injected client would always get that same client when calling Client(). With this change one would have to call GetClient() explicitly to get that back, while calling Client() would result in an error if there was no kubeconfig configured for example, which is something totally acceptable if you just injected your own client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it makes sense then. If you can find a way to support this without overcomplicating thing, I'm all in.
Otherwise, maybe we could consider doing a breaking change? Since we are running in version v0.x.y, I feel like we can afford to do breaking changes if it means we have more stable foundations to run the e2e tests at scale.

Copy link
Contributor

@phisco phisco May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my draft PR, unfortunately I couldn't get the tests to run yet.

@k8s-ci-robot k8s-ci-robot merged commit 56e4a65 into kubernetes-sigs:main Jul 8, 2024
1 of 2 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not share config across tests Calling Config's Client() in tests running parallel yields a data race
6 participants