Add legacyColumnMode configuration #394

oguzhanunlu · 2025-01-28T11:08:45Z

This commit introduces a new feature flag legacyColumnMode , changing loading behavior such that all events are loaded to legacy columns regardless of legacyColumns configuration. Set it to true to enable legacy behavior.

ref: PDP-1489

istreeter · 2025-01-28T11:16:19Z

modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/Processing.scala

-            map1 <- v2Transform
-            map2 <- legacyTransform
-          } yield event -> (map1 ++ map2)
+          if (legacyColumnMode) LegacyColumns.transformEvent(badProcessor, event, legacyEntities).map(event -> _)


This line doesn't look right. I might be wrong... but are you sure you're loading the atomic fields, like event_id and collector_tstamp etc?

I'll add a test for that

This commit introduces a new feature flag `legacyColumnMode` , changing loading behavior such that all events are loaded to legacy columns regardless of `legacyColumns` configuration. Set it to true to enable legacy behavior.

istreeter · 2025-01-31T20:56:00Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

+  def alter1_legacy      = alter1_base(legacyColumns = true, timeout = true, legacyColumnMode = false)
+  def alter1_full_legacy = alter1_base(legacyColumns = true, timeout = true, legacyColumnMode = true)


What is the difference between alter1_legacy and alter1_full_legacy? It looks like you are testing exactly the same thing twice.

Yeah, it doesn't have a meaning as it is. I'll remove it.

istreeter · 2025-01-31T20:56:35Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

+  def alter2_legacy      = alter2_base(legacyColumns = true, timeout = true, legacyColumnMode = false)
+  def alter2_full_legacy = alter2_base(legacyColumns = true, timeout = true, legacyColumnMode = true)


Same as above. alter2_legacy and alter2_full_legacy test exactly the same thing twice.

istreeter · 2025-01-31T21:11:47Z

modules/core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/MockEnvironment.scala

+    case class WroteNRowsToBigQuery(n: Int) extends Action
+    case class WroteRowsToBigQuery(rows: Iterable[Map[String, AnyRef]]) extends Action


I see why you did this. But it doesn't feel right to have two different Action classes representing the same action. And you introduced lots of boolean flags recordRows = false throughout the specs.

Here's an alternative idea. Don't change the Action classes. Leave it like it was before, like:

case class WroteRowsToBigQuery(rowCount: Int) extends Action

And instead, change the class MockEnvironment so it captures the actions AND the event content

case class State(actions: Vector[Action], writtenToBQ: Iterable[Map[String, AnyRef]]) case class MockEnvironment(state: Ref[IO, State], environment: Environment[IO])

Then any of the specs can (if they want to) check the event content.

You might even find other existing specs that benefit from doing a check of the event content.

istreeter · 2025-02-03T20:31:02Z

modules/core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/MockEnvironment.scala


 import scala.concurrent.duration.{DurationInt, FiniteDuration}

-case class MockEnvironment(state: Ref[IO, Vector[MockEnvironment.Action]], environment: Environment[IO])
+case class State(actions: Vector[Action], writtenToBQ: Iterable[Map[String, AnyRef]])


My personal style preference, is to define the State class within the MockEnvironment object. Just because I don't like having top-level classes defined inside a file, where the filename does not match the class name.

This is nitpicking though, sorry!

istreeter · 2025-02-03T20:36:32Z

...test/scala/com.snowplowanalytics.snowplow.bigquery/processing/LegacyColumnsResolveSpec.scala

+        (failures must beEmpty) and
+          (fields must contain(expected))


In all of the other "ue" tests, we have an assertion like (fields must haveSize(1)). Should we have the same here?

Same comment too for the c6 test you added in this file.

istreeter · 2025-02-03T20:44:04Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

+      } yield {
+        val rows = state.writtenToBQ
+        (rows.size shouldEqual inputs.head.events.size) and
+          (rows.head.get("event_id") should beEqualTo(Option(eventID.toString))) and


specs2 has some nice syntax for working with options:

rows.head.get("event_id") should beSome(eventID.toString)

It also has nice syntax for working with maps:

rows.head should havePair("event_id" -> eventID.toString)

istreeter · 2025-02-03T21:08:11Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

+    TestControl.executeEmbed(io)
+  }
+
+  def e14_base(legacyColumnMode: Boolean) = {


My criticism of this test is.... it is too different from all the other tests in this file. And I don't see why it needs to be different. In all other tests, the thing we test is: What is the result of calling Processing.stream(environment) with different inputs.

Processing.stream() is the public method, and we require it has the correct end-to-end behaviour. Whereas the method Processing.resolveV2NonAtomicFields is private, and therefore an implementation detail of Processing.stream.

The two behaviours you are trying to test here are:

If legacyColumnMode is enabled, then do we load those legacy columns into BigQuery?

If legacyColumnMode is not enabled, then do we load v2-style columns into BigQuery?

I am fairly sure you can test those two behaviours, while being consistent with the pattern of other tests in this file.

Thanks for the feedback! I wanted to test the change in isolation fearing that testing the whole pipe could miss it in future if not today. I guess it was unnecessary. I can surely stay consistent with other tests.

istreeter · 2025-02-04T13:38:44Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

+        _ <- Processing.stream(control.environment).compile.drain
+        state <- control.state.get
+      } yield state.actions should beEqualTo(
+        Vector(


This is a nice improvement compared to when I reviewed it yesterday.

I think one tiny space for improvement... in this spec, currently this is the important line which covers the new feature that you added:

Action.AlterTableAddedColumns(Vector(expectedColumnName)),

But that's it. In terms of testing the feature you added, currently you are just testing that it tried to alter the table with the expected column.

There is an opportunity to test more: You could also check state.writtenToBQ, and that will tell you whether it transformed the data in the expected way.

(Separately, I would like to amend some of the older tests to also check state.writtenToBQ, but that is beyond the scope of this PR).

istreeter · 2025-02-04T13:42:01Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

@@ -621,19 +632,117 @@ class ProcessingSpec extends Specification with CatsEffect {
  def e12       = e12Base(legacyColumns = false)
  def e12Legacy = e12Base(legacyColumns = true)

+  def e13 = {


The title of this test, taken from above, is

Use legacy columns for all fields when legacyColumnMode is enabled $e13

Do you think that is a good description of what is actually tested in the test implementation? Looks like you are just testing that it sets various atomic fields. It's a nice check... but the loader should pass this test whether or not legacyColumnMode is enabled.

yeah, I'll get rid of e13 as e14 and e15 test the feature

istreeter · 2025-02-04T13:42:25Z

modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/Processing.scala

@@ -174,6 +175,13 @@ object Processing {
      )
    }

+  private[processing] def resolveV2NonAtomicFields[F[_]: Async: RegistryLookup](


I think this can be full private, not private[processing].

istreeter · 2025-02-04T13:45:48Z

I hit "approve" by accident.

istreeter · 2025-02-04T15:12:19Z

.../core/src/test/scala/com.snowplowanalytics.snowplow.bigquery/processing/ProcessingSpec.scala

+      )) and
+        (state.writtenToBQ.head should haveKey(expectedColumnName))
+    }


Excellent. Really nice improvement to the machinery of this test suite.

A few of the tests in `ProcessingSpec` had recently been disabled by replacing the real test with `TestControl.executeEmbed(io.timeout(10.seconds))`. This PR fixes and re-enables the test, so we have better coverage of when legacy modes are enabled. Also, I make more use of the new ability to test `writtenToBQ` which was added in #394. This means we have more test coverage that the expected fields get loaded.

This commit introduces a new feature flag `legacyColumnMode` , changing loading behavior such that all events are loaded to legacy columns regardless of `legacyColumns` configuration. Set it to true to enable legacy behavior.

A few of the tests in `ProcessingSpec` had recently been disabled by replacing the real test with `TestControl.executeEmbed(io.timeout(10.seconds))`. This PR fixes and re-enables the test, so we have better coverage of when legacy modes are enabled. Also, I make more use of the new ability to test `writtenToBQ` which was added in #394. This means we have more test coverage that the expected fields get loaded.

This commit introduces a new feature flag `legacyColumnMode` , changing loading behavior such that all events are loaded to legacy columns regardless of `legacyColumns` configuration. Set it to true to enable legacy behavior.

A few of the tests in `ProcessingSpec` had recently been disabled by replacing the real test with `TestControl.executeEmbed(io.timeout(10.seconds))`. This PR fixes and re-enables the test, so we have better coverage of when legacy modes are enabled. Also, I make more use of the new ability to test `writtenToBQ` which was added in #394. This means we have more test coverage that the expected fields get loaded.

This commit introduces a new feature flag `legacyColumnMode` , changing loading behavior such that all events are loaded to legacy columns regardless of `legacyColumns` configuration. Set it to true to enable legacy behavior.

A few of the tests in `ProcessingSpec` had recently been disabled by replacing the real test with `TestControl.executeEmbed(io.timeout(10.seconds))`. This PR fixes and re-enables the test, so we have better coverage of when legacy modes are enabled. Also, I make more use of the new ability to test `writtenToBQ` which was added in #394. This means we have more test coverage that the expected fields get loaded.

- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader

oguzhanunlu self-assigned this Jan 28, 2025

istreeter reviewed Jan 28, 2025

View reviewed changes

Add legacyColumnMode configuration

e1835ee

This commit introduces a new feature flag `legacyColumnMode` , changing loading behavior such that all events are loaded to legacy columns regardless of `legacyColumns` configuration. Set it to true to enable legacy behavior.

oguzhanunlu force-pushed the full-legacy branch from f159a43 to e1835ee Compare January 28, 2025 11:19

add test

fb1d9a3

istreeter reviewed Jan 31, 2025

View reviewed changes

address feedback

3784d7f

oguzhanunlu requested a review from istreeter February 3, 2025 14:17

oguzhanunlu added 2 commits February 3, 2025 18:17

Add new tests for LegacyColumnsResolveSpec

816a4c0

add tests for resolveV2NonAtomicFields

d50e024

istreeter reviewed Feb 3, 2025

View reviewed changes

address feedback

fc3f727

oguzhanunlu requested a review from istreeter February 4, 2025 13:23

istreeter approved these changes Feb 4, 2025

View reviewed changes

istreeter self-requested a review February 4, 2025 13:45

oguzhanunlu added 2 commits February 4, 2025 17:31

make it to the point

1b1955c

bring full action check back

cd2ff42

istreeter reviewed Feb 4, 2025

View reviewed changes

istreeter approved these changes Feb 4, 2025

View reviewed changes

oguzhanunlu merged commit 09d7255 into v2 Feb 4, 2025
2 checks passed

oguzhanunlu deleted the full-legacy branch February 4, 2025 15:15

istreeter mentioned this pull request Feb 6, 2025

Fix and improve ProcessingSpec for legacy column mode #396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add legacyColumnMode configuration #394

Add legacyColumnMode configuration #394

oguzhanunlu commented Jan 28, 2025 •

edited

Loading

istreeter Jan 28, 2025

oguzhanunlu Jan 28, 2025

istreeter Jan 31, 2025

oguzhanunlu Feb 3, 2025

istreeter Jan 31, 2025

istreeter Jan 31, 2025

istreeter Feb 3, 2025

istreeter Feb 3, 2025

istreeter Feb 3, 2025

istreeter Feb 3, 2025

oguzhanunlu Feb 4, 2025

istreeter Feb 4, 2025

istreeter Feb 4, 2025

oguzhanunlu Feb 4, 2025

istreeter Feb 4, 2025

istreeter commented Feb 4, 2025

istreeter Feb 4, 2025

		def alter1_legacy = alter1_base(legacyColumns = true, timeout = true, legacyColumnMode = false)
		def alter1_full_legacy = alter1_base(legacyColumns = true, timeout = true, legacyColumnMode = true)

		def alter2_legacy = alter2_base(legacyColumns = true, timeout = true, legacyColumnMode = false)
		def alter2_full_legacy = alter2_base(legacyColumns = true, timeout = true, legacyColumnMode = true)

		case class WroteNRowsToBigQuery(n: Int) extends Action
		case class WroteRowsToBigQuery(rows: Iterable[Map[String, AnyRef]]) extends Action

Add legacyColumnMode configuration #394

Add legacyColumnMode configuration #394

Conversation

oguzhanunlu commented Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

istreeter commented Feb 4, 2025

Choose a reason for hiding this comment

oguzhanunlu commented Jan 28, 2025 •

edited

Loading