add blobaccess for Spanner/GCS #194

srago · 2024-03-10T00:24:33Z

A blobaccess implementation for Spanner/GCS. Spanner holds metadata and inlined small objects for blobs. Larger blobs are written to GCS instead of being inlined.

EdSchouten

It looks like you're using Spanner as a flat key-value data store, where the CAS and AC are represented as two completely disjoint tables. As far as I know, Spanner is supposed to be a relational database. Why aren't we using any of its relational capabilities?

EdSchouten · 2024-03-10T14:53:59Z

pkg/proto/configuration/blobstore/blobstore.proto

@@ -194,6 +194,9 @@ message BlobAccessConfiguration {

    // Refer to a BlobAccess object declared through 'with_labels'.
    string label = 27;
+
+    // Actually a hybrid of spanner and GCS.


If we start off with naming something A, while needing to put a label above it that says: "Actually a B", then we should have simply named it B. Please call this backend SpannerGCSBlobAccess.

EdSchouten · 2024-03-10T14:57:14Z

pkg/proto/configuration/blobstore/blobstore.proto

+  }
+}
+
+message SpannerBlobAccessConfiguration {


Even though we currently don't have a lot of meaningful client options for GCP, could you please add pkg/proto/configuration/cloud/gcp/gcp.proto's ClientOptionsConfiguration here as well? Just grep for gcp.NewClientOptionsFromConfiguration for an example on how to use it.

EdSchouten · 2024-03-10T14:58:01Z

pkg/proto/configuration/blobstore/blobstore.proto

+message StorageType {
+  enum Value {
+    // Invalid value.
+    UNKNOWN = 0;
+
+    // Action Cache.
+    ACTION_CACHE = 1;
+
+    // Content-Addressable Store.
+    CASTORE = 2;
+  }
+}


Having such an enumeration should not be necessary. The storage type is already implied. Please take a look at BlobAccessCreator.

I'll look into that interface. It didn't exist in the older code base we started our project with.

EdSchouten · 2024-03-10T14:58:12Z

pkg/proto/configuration/blobstore/blobstore.proto

+  StorageType.Value storage_type = 3;
+
+  // Expiration time in days.  The default is 7 days.
+  uint64 expiration_days = 4;


Please use google.protobuf.Duration.

It's a little more general than we need, but I'll look into it.

Sure. I can imagine that nobody is interested in setting expirations in nanoseconds. But the advantage of using types like these is that it's well typed. Furthermore, it's becomes self-documenting, meaning we don't need to use suffixes like _days.

You do realize that the GCS and Spanner expiration times are measured in days, right?

Yes. And the last time I checked, days are a duration.

Of course. But specifying the number of days as a Duration means that users have to convert the days into seconds to configure the expiration time. Then the driver needs to convert them back into days and have some policy for handling invalid values. If the user specifies 3 hours, does the driver fail that or does it round up to 1 day? I've already started working on these changes, but it feels like the wrong approach to me.

The fact that our existing config file parser/protojson only accepts the '1s' syntax for seconds is a limitation of those systems specifically. It should not be a reason against using google.protobuf.Duration. For all we know, the user wants to generate a config file using Python, and there they'd use datetime.timedelta(days=14) in combination with google.protobuf.duration_pb2.Duration.FromTimedelta().

The Pkl configuration language actually has a native data type for durations. There you can just write 14.d to get a literal that corresponds to 14 days. Converting that to "1209600s" to appease the protojson parser would happen automatically as part of rendering.

Rounding it up to full days seems like the most sensible choice to me.

EdSchouten · 2024-03-10T15:00:20Z

pkg/blobstore/spanner_blob_access.go

+// a time incurs too much overhead in spanner, so we perform bulk operations.  The downside of
+// this approach is that we can lose reference updates if the servers reboot while updates are
+// still queued.  Worst case, these objects will be evicted and need to be rebuilt the next time
+// they are needed.


I don't think this tradeoff is acceptable.

Performance in unacceptable without this (databases make poor file systems). Potentially rebuilding stuff that is evicted earlier than it might be is an acceptable tradeoff, because it still results in correct operation. However, I'm open to ideas on how we can mitigate this.

EdSchouten · 2024-03-10T15:08:49Z

pkg/blobstore/spanner_blob_access.go

+func (ba *spannerBlobAccess) keyToBlobSize(key string) (int64, error) {
+	s := strings.Split(key, "-")
+	sz, err := strconv.ParseInt(s[1], 10, 64)
+	if err != nil {
+		return -1, err
+	}
+	return sz, nil
+}


How does this hold for the Action Cache? In the case of the Action Cache the Action Digest pertains to the size of the Action message. Not the ActionResult.

Oh, nice catch. It's basically dumb luck that this doesn't cause problems, because action cache entries and action messages are small, so we end up using Spanner. The only place I validate the buffer size is in the CAS.

I rewrote the code to use the knowledge of where the blobs live. On the write side, we use the size to determine where the blob should be stored. On the read side, we use the existence of InlineData in the Spanner record to determine where the blob lives. I have an MR out to (optionally) write small blobs to GCS also so that they can be accessed from GCS-FUSE. As soon as I get verification that it works, I can send a PR upstream.

EdSchouten · 2024-03-10T15:09:05Z

pkg/blobstore/spanner_blob_access.go

+	storageClient, err := storage.NewClient(ctx)
+	if err != nil {
+		spannerClient.Close()
+		log.Printf("Can't create GCS client: %v", err)


Either log errors or propagate them, but not both.

That only works if the caller receiving the error logs/prints it. In this case, it's enabling us to debug start-up problems by simply looking at the k8s logs.

You can also install things like middleware for gRPC to do the logging. That way we don't need to plaster our codebase with logging statements.

EdSchouten · 2024-03-10T15:13:06Z

pkg/blobstore/spanner_blob_access.go

+func (ba *spannerBlobAccess) GetCapabilities(ctx context.Context, instanceName digest.InstanceName) (*remoteexecution.ServerCapabilities, error) {
+	if ba.storageType == pb.StorageType_ACTION_CACHE {
+		return &remoteexecution.ServerCapabilities{
+			CacheCapabilities: &remoteexecution.CacheCapabilities{
+				ActionCacheUpdateCapabilities: &remoteexecution.ActionCacheUpdateCapabilities{
+					UpdateEnabled: true,
+				},
+				SymlinkAbsolutePathStrategy: remoteexecution.SymlinkAbsolutePathStrategy_ALLOWED,
+			},
+		}, nil
+	} else if ba.storageType == pb.StorageType_CASTORE {
+		return &remoteexecution.ServerCapabilities{
+			CacheCapabilities: &remoteexecution.CacheCapabilities{
+				DigestFunctions: digest.SupportedDigestFunctions,
+			},
+		}, nil
+	} else {
+		// Code in pkg/blobaccess/configuration/new_blob_access.go should prevent this, but
+		// since that lives in a separate place, let's be sure.
+		panic("Invalid Spanner storage Type configured")
+	}
+}


Any reason this can't leverage capabilities.Provider?

I was getting weird results in the capabilities result message, so I rolled my own. I'll try to get you an example.

The anomaly occurs regardless of whether I use capabilities.Provider or roll my own code:

Capabilities for gs.com received:
cache_capabilities:{digest_functions:MD5 digest_functions:SHA1 digest_functions:SHA256 digest_functions:8 digest_functions:SHA384 digest_functions:SHA512 action_cache_update_capabilities:{} symlink_absolute_path_strategy:ALLOWED} deprecated_api_version:{major:2} low_api_version:{major:2} high_api_version:{major:2 minor:3}

Capabilities for gs.com/unreviewed received:
cache_capabilities:{digest_functions:MD5 digest_functions:SHA1 digest_functions:SHA256 digest_functions:8 digest_functions:SHA384 digest_functions:SHA512 action_cache_update_capabilities:{update_enabled:true} symlink_absolute_path_strategy:ALLOWED} execution_capabilities:{digest_function:SHA256 exec_enabled:true execution_priority_capabilities:{priorities:{min_priority:-2147483648 max_priority:2147483647}} 5:"\x03\x02\x01\x08\x05\x06"} deprecated_api_version:{major:2} low_api_version:{major:2} high_api_version:{major:2 minor:3}

Notice the 5:"\x03\x02\x01\x08\x05\x06" that occurs in the execution_capabilities in the second case.

Whatever tool you used to dump these message, make sure it uses an up-to-date version of remote_execution.proto. Once you do, you'll notice that field 5 in ExecutionCapabilities is actually digest_functions:

bazelbuild/remote-apis@64cc5e9

And what Buildbarn does here is expected. Namely, it reports the list of digest functions that are supported by the scheduler.

Yep, rebuilding the tool prints the hash functions. I hadn't realized the remote APIs has changed that much.

EdSchouten · 2024-03-10T15:15:16Z

pkg/blobstore/spanner_blob_access.go

+		statedSize, err := ba.keyToBlobSize(key)
+		if err != nil {
+			spannerMalformedKeyCount.Inc()
+			log.Printf("Put Blob: invalid digest size, key %s: %v", key, err)
+			b.Discard()
+			return err
+		}
+		if statedSize != size {
+			log.Printf("Put Blob: MISMATCH between buffer size (%d) and digest size (%d)", size, statedSize)
+			b.Discard()
+			return fmt.Errorf("Invalid contents digest %s size %d", digest, size)
+		}


Do we really need code like this? pkg/blobstore/buffer already has logic for validating data integrity.

EdSchouten · 2024-03-10T15:15:40Z

pkg/blobstore/spanner_blob_access.go

+	}
+
+	start := time.Now()
+	_, err = ba.spannerClient.Apply(context.Background(), []*spanner.Mutation{insertMut})


Why use context.Background() here?

This one should be ctx. I use context.Background when I want to perform an action, but have it unaffected if the context gets cancelled.

srago · 2024-03-10T16:02:15Z

It looks like you're using Spanner as a flat key-value data store, where the CAS and AC are represented as two completely disjoint tables. As far as I know, Spanner is supposed to be a relational database. Why aren't we using any of its relational capabilities?

We have one table that is shared between the AC and CAS. The driver will behave differently when accessing the table for AC-based operations versus CAS-based operations.

EdSchouten · 2024-03-10T18:57:42Z

We have one table that is shared between the AC and CAS. The driver will behave differently when accessing the table for AC-based operations versus CAS-based operations.

Sure, I get that. But what I'm saying is that the references between ActionResult messages and objects stored in the CAS could also be expressed as a relation (foreign key) between two tables in a database. If we did that, then we could most likely eliminate CompletenessCheckingBlobAccess for these kinds of setups. AC entries could be dropped automatically when associated CAS objects expire.

srago · 2024-03-17T17:34:23Z

We have one table that is shared between the AC and CAS. The driver will behave differently when accessing the table for AC-based operations versus CAS-based operations.

Sure, I get that. But what I'm saying is that the references between ActionResult messages and objects stored in the CAS could also be expressed as a relation (foreign key) between two tables in a database. If we did that, then we could most likely eliminate CompletenessCheckingBlobAccess for these kinds of setups. AC entries could be dropped automatically when associated CAS objects expire.

That isn't immediately obvious to me. We should discuss...

add blobaccess for Spanner/GCS

d825170

EdSchouten requested changes Mar 10, 2024

View reviewed changes

rename spanner_blob_access to spanner_gcs_blob_access

1806717

srago added 4 commits March 18, 2024 08:15

address code review comments from first PR attempt

8d9e809

update spannerGcs blob expiration time to use time.Duration

f1baeef

bug fixes to get builds working again

90d6a4e

remove storage type from spannerGcsBlobAccess config

34729d6

srago requested a review from EdSchouten March 20, 2024 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add blobaccess for Spanner/GCS #194

add blobaccess for Spanner/GCS #194

srago commented Mar 10, 2024

EdSchouten left a comment

EdSchouten Mar 10, 2024

srago Mar 10, 2024

EdSchouten Mar 10, 2024

EdSchouten Mar 10, 2024

srago Mar 10, 2024

EdSchouten Mar 10, 2024

srago Mar 10, 2024

EdSchouten Mar 18, 2024

srago Mar 18, 2024

EdSchouten Mar 18, 2024

srago Mar 18, 2024

EdSchouten Mar 18, 2024 •

edited

Loading

EdSchouten Mar 10, 2024

srago Mar 10, 2024

EdSchouten Mar 10, 2024

srago Mar 10, 2024

srago Mar 19, 2024

EdSchouten Mar 10, 2024

srago Mar 17, 2024

EdSchouten Mar 17, 2024

EdSchouten Mar 10, 2024

srago Mar 10, 2024 •

edited

Loading

srago Mar 19, 2024

EdSchouten Mar 19, 2024

srago Mar 19, 2024

EdSchouten Mar 10, 2024

srago Mar 19, 2024

EdSchouten Mar 10, 2024

srago Mar 17, 2024

srago commented Mar 10, 2024

EdSchouten commented Mar 10, 2024 •

edited

Loading

srago commented Mar 17, 2024

add blobaccess for Spanner/GCS #194

Are you sure you want to change the base?

add blobaccess for Spanner/GCS #194

Conversation

srago commented Mar 10, 2024

EdSchouten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srago Mar 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srago commented Mar 10, 2024

EdSchouten commented Mar 10, 2024 • edited Loading

srago commented Mar 17, 2024

EdSchouten Mar 18, 2024 •

edited

Loading

srago Mar 10, 2024 •

edited

Loading

EdSchouten commented Mar 10, 2024 •

edited

Loading