Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Revert "Remove CallContext and its ThreadLocal usage" #1000

Merged
merged 35 commits into from
Feb 22, 2025

Conversation

eric-maynard
Copy link
Contributor

This fully reverts #589, returning CallContext usage to its previous state before CDI

@eric-maynard eric-maynard marked this pull request as ready for review February 20, 2025 07:01
* underlying nature of the persistence layer may differ between different realms.
*/
public interface CallContext extends AutoCloseable {
InheritableThreadLocal<CallContext> CURRENT_CONTEXT = new InheritableThreadLocal<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this revert, as long as we all agree to remove thread locals (again) after merging this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah after doing this revert and debugging all the test issues with ThreadLocal I am very eager to see ThreadLocal gone and injection used consistently.

private final List<TaskHandler> taskHandlers = new CopyOnWriteArrayList<>();

public TaskExecutorImpl(
CallContext callContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this basically is coming from QuarkusTaskExecutorImpl which can't possibly be injecting the correct CallContext, let's leave this out of the constructor for now. We need to figure out how to propagate the original call context for the ManifestFileCleanupTaskHandler and TableCleanupTaskHandler in different ways.

The pattern established here is that addTaskHandlerContext apparently is how we preserve a CallContext manually. So we just need to make sure that thing plumbs through.

Copy link
Contributor

@adutra adutra Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,CallContext should not be injected but cloned. Here is how it was handled before:

public void addTaskHandlerContext(long taskEntityId, CallContext callContext) {
// Unfortunately CallContext is a request-scoped bean and must be cloned now,
// because its usage inside the TaskExecutor thread pool will outlive its
// lifespan, so the original CallContext will eventually be closed while
// the task is still running.
// Note: PolarisCallContext has request-scoped beans as well, and must be cloned.
// FIXME replace with context propagation?
CallContext clone = CallContext.copyOf(callContext);
tryHandleTask(taskEntityId, clone, null, 1);
}

The reason is that CallContext being a request-scoped bean, it becomes tricky to pass it on to a different thread pool, as the instance might survive there longer than the HTTP request itself. Thus it becomes necessary to create a copy of it and pass the copy to the executor. (The copy must be a deep clone: PolarisCallContext must be cloned too, especially because of the metastore session bean.)

Side note 1: Quarkus has context propagation techniques, and in fact, the CallContext bean would be properly propagated to the executor, because the executor is a Quarkus-managed executor:

public ManagedExecutor taskExecutor(TaskHandlerConfiguration config) {
return SmallRyeManagedExecutor.builder()
.injectionPointName("task-executor")
.propagated(ThreadContext.ALL_REMAINING)
.maxAsync(config.maxConcurrentTasks())
.maxQueued(config.maxQueuedTasks())
.build();
}

But my findings showed that this will not work in this case, and I believe it's because CallContext has a close() method that is called when the request processing ends:

public void closeCallContext(@Disposes CallContext callContext) {
callContext.close();
}

Side note2: I also think the CallContext.close() method should be removed and the catalog-closing logic should be refactored. I will log an issue for that as soon as this PR is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering: is this a clean revert? Why isn't this class back to the exact state it had before the removal of CallContext?

According to git log, it should correspond to commit 0a2f9df, which is the commit before the CallContext removal by b84f462.

So this class should look like this:

https://github.com/apache/polaris/blob/0a2f9df435cde79f65fed49a20a40bd96f5d1f37/service/common/src/main/java/org/apache/polaris/service/task/TaskExecutorImpl.java

In that commit, CallContext wasn't being injected through the constructor. And afaict, it wasn't either in any of the previous revisions of this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's definitely not a clean revert unfortunately. There were very many conflicts with the revert applied. Taking a look at TaskExecutorImpl.java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks like with the latest fix plumbing ctx through handleTask it should work now. In general it's probably best to continue handling CallContext somewhat explicitly because even though right now the "background processing" is able to just be best-effort within the same process, the design of having TaskEntity persistence really is supposed to reflect reliable task execution on server failures.

In which case, we need a formal model for creating a "SystemCallContext" of sorts that may or may not "impersonate" parts of the original call context that resulted in the system event. And we'd need to be able to serialize such CallContext state into the TaskEntity and restore it on server startup to create the CallContext with which to perform the system tasks.

Incidentally, this also means there are intended flows that aren't technically @RequestScoped.

protected void handleTask(long taskEntityId, RealmContext realmContext, int attempt) {
protected void handleTask(long taskEntityId, CallContext ctx, int attempt) {
// set the call context INSIDE the async task
CallContext.setCurrentContext(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, can we passthrough the CallContext to boolean success = handler.handleTask(task); below? Like

boolean success = handler.handleTask(task, callContext);

I guess that requires changing the signature but right now the underlying handler doesn't even use the threadlocal so this setCurrentContext doesn't help it at all. Do we know if anything downstream here actually does require the threadlocal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could also do both to "be safe", but in general any setCurrentContext that isn't paired up with a try {...} finally { CallContext.setCurrentContext(null); } seems dangerous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was done before:

protected void handleTask(long taskEntityId, CallContext ctx, int attempt) {
// set the call context INSIDE the async task
CallContext.setCurrentContext(ctx);

(We weren't clearing the context in a finally block though – but I don't think it's necessary since the thread local is being set at the very beginning of every task execution. It wouldn't hurt to add it though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed per this comment -- good catch!

Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@dennishuo dennishuo merged commit ccf25df into apache:main Feb 22, 2025
5 checks passed
eric-maynard added a commit to eric-maynard/polaris that referenced this pull request Mar 13, 2025
* Publish proposals and roadmap on the website (apache#1043)

* Allow Polaris CLI to set property values that contain '=' (apache#1003)

Currently the CLI parsing is too pessimistic and throws an error if
it detects '=' in the parsed value of a property. However, the split 1
semantics are already standard behavior for most parsers where the
right-hand-side is allowed to contain '='.

This commonly comes up if the value is itself a comma-separated kv list
or if it the value is a base64-encoded string.

* main: Update postgres Docker tag to v17.4 (apache#1049)

* main: Update dependency org.junit:junit-bom to v5.12.0 (apache#1045)

* Revert "Remove CallContext and its ThreadLocal usage" (apache#1000)

* revert b84f462

* resolve conflicts

* autolint

* autolint

* ...not possible to automatically add a synthetic no-args constructor to an unproxyable bean class.

* fix

* fixed non-quarkus build issues

* fixed non-quarkus build issues

* autolint

* more di issues

* add more producers

* add another producer

* autolint

* change callcontext usage in FileIOFactoryTest

* stablize tests

* autolint

* fix a test

* refactor TableCleanupTaskHandlerTest

* another fix, still npe

* autolint

* only task cleanup is problematic

* autolint

* Stabilize more tests

* autolint

* Avoid propagating old callcontexts across TestServices instances

* testservices update

* autolint

* Revert "autolint"

This reverts commit 66a7f22.

* Revert "testservices update"

This reverts commit 897a5a1.

* changes to taskexecutorimpl

* autolint

---------

Co-authored-by: Dennis Huo <dennis.huo+oss@snowflake.com>

* Enable console color when running Polaris with Gradle (apache#977)

* CLI: add subcommand access for principals (apache#1019)

* Access subcommand access for principals

* Access subcommand access for principals

* Fix getting start spark (apache#1054)

* Use correct link format for adoc and update menu title for proposals (apache#1051)

* Add a debug log with the full exception (apache#1023)

* clean up

* spotless

* clean up

* refactor message

* no CVEM

* change log levels

* revert polaris-java dep change

* add testImplementation

* uppercase

* Remove all tightly coupled EntityCache dependencies in the main persistence stack (apache#1055)

Remove the EntityCacheEntry wrapper since the timestamp fields were never used anyways;
instead the underlying Caffeine cache transparently handles access times, and the
types of entries we cache are simply the existing ResolvedPolarisEntity.

Remove interactions of business logic with explicit "cache entries", instead
operating on ResolvedPolarisEntity. Fix the equals()/hashCode() behavior of
PolarisEntity to be compatible with PolarisBaseEntity as intended.

Improve code comments to explain the (current) relationship between PolarisEntity
and PolarisBaseEntity, and clarify the behaviors in Resolver.java.

Fully remove the PolarisRemoteCache interface and its methods. Add different
methods that aren't cache-specific instead.

* Remove PolarisMetaStoreSession from FileIOFactory/FileIOUtil in favor of CallContext (apache#1057)

This appeared to be some leaky divergence that occurred after CallContext had been
removed, but PolarisMetaStoreSession really is only a low-level implementation detail
that should never be handled by BasePolarisCatalog/FileIOFactory.

This plumbs CallContext explicitly into the FileIOFactory and FileIOUtil methods and
thus removes a large source of CallContext.getCurrentContext calls; now the threadlocal
doesn't have to be set at all in BasePolarisCatalogTest.

* Cleanup Iceberg Manifest list files on Purge (apache#1039)

* Fix small quickstart typo (apache#1061)

* Enable rat check on md files and fix the existing ones (apache#1050)

* main: Update dependency boto3 to v1.37.0 (apache#1052)

* main: Update dependency software.amazon.awssdk:bom to v2.30.27 (apache#1053)

* Integration tests: improve support for temporary folders (apache#987)

3 integration tests currently require a temporary folder for various purposes. The folders are generally hard-coded to some local filesystem location, which makes those tests impossible to execute against a remote Polaris instance.

This PR aims at fixing this situation by leveraging a new optional environment variable that points to a location that is accessible by both the server and the client running the tests. This would typically be an S3 bucket, for example, but could also be a shared volume mounted on both machines at the same mount point. If this environment variable is not present, then local execution is assumed and a temporary directory managed by JUnit is used instead.

* build-logic: exclude generated code from errorprone (apache#1035)

* build-logic: let javadoc depend on jandex (apache#1034)

* Add Prashant in the collaborators list (apache#1064)

* main: Update dependency org.slf4j:slf4j-api to v2.0.17 (apache#1065)

---------

Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Dennis Huo <7410123+dennishuo@users.noreply.github.com>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: Dennis Huo <dennis.huo+oss@snowflake.com>
Co-authored-by: Alexandre Dutra <adutra@users.noreply.github.com>
Co-authored-by: MonkeyCanCode <yongzheng0809@gmail.com>
Co-authored-by: Andrew Guterman <andrew.guterman1@gmail.com>
Co-authored-by: Prashant Singh <35593236+singhpk234@users.noreply.github.com>
Co-authored-by: Robert Stupp <snazy@snazy.de>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants