refactor: improve manifest scanning organization and concurrency #252

iskakaushik · 2025-01-12T21:09:04Z

This refactor breaks down the manifest scanning logic into more focused components in preparation for adding incremental scanners to allow reading changelog and diffs between two snapshots.

Key changes include:

Add manifestEntries type to safely collect data and delete entries concurrently
Split manifest handling into separate fetchPartitionSpecFilteredManifests and collectManifestEntries functions for better separation of concerns
Replace manual goroutine management with errgroup for more robust concurrency
Add documentation comments explaining the manifest scanning process

This is a step toward adding a ManifestGroup abstraction similar to the Java implementation that can be shared among different scanner types.

This refactor breaks down the manifest scanning logic into more focused components in preparation for adding incremental scanners to allow reading changelog and diffs between two snapshots. Key changes include: - Add manifestEntries type to safely collect data and delete entries concurrently - Split manifest handling into separate fetchPartitionSpecFilteredManifests and collectManifestEntries functions for better separation of concerns - Replace manual goroutine management with errgroup for more robust concurrency - Add documentation comments explaining the manifest scanning process This is a step toward adding a ManifestGroup abstraction similar to the Java implementation that can be shared among different scanner types.

zeroshade · 2025-01-13T21:55:05Z

table/scanner.go

+		g.Go(func() error {
+			partEval := partitionEvaluators.Get(int(mf.PartitionSpecID()))
+			manifestEntries, err := openManifest(scan.io, mf, partEval, metricsEval)
+			if err != nil {
+				return err
 			}

-			for _, e := range entries {
+			for _, e := range manifestEntries {
 				df := e.DataFile()
 				switch df.ContentType() {
 				case iceberg.EntryContentData:
-					dataEntries = append(dataEntries, e)
+					entries.addDataEntry(e)
 				case iceberg.EntryContentPosDeletes:
-					positionalDeleteEntries = append(positionalDeleteEntries, e)
+					entries.addPositionalDeleteEntry(e)
 				case iceberg.EntryContentEqDeletes:
-					return nil, fmt.Errorf("iceberg-go does not yet support equality deletes")
+					return fmt.Errorf("iceberg-go does not yet support equality deletes")
 				default:
-					return nil, fmt.Errorf("%w: unknown DataFileContent type (%s): %s",
+					return fmt.Errorf("%w: unknown DataFileContent type (%s): %s",
 						ErrInvalidMetadata, df.ContentType(), e)
 				}
 			}
-		}
+			return nil
+		})
+	}


So we're switching from utilizing a channel and fanning out with goroutines reading from that channel to splitting out a goroutine for each manifest.

Is there any particular benefit/reason for that change beyond the simplified code?

No, mostly simplified code. For context I am working on adding a couple more scanners and trying to build an abstraction that would make it easy.

Fair enough. This seems reasonable to me and is unlikely to cause any issues I think. So I think we can move forward with this refactor. It might be worthwhile looking into adding some benchmarking to track the performance of the planning on various numbers of manifests and manifest entries so that we can keep track of it in the future.

Not something that we need for this particular change, but definitely something to look into.

zeroshade

LGTM Thanks!

zeroshade reviewed Jan 13, 2025

View reviewed changes

zeroshade approved these changes Jan 13, 2025

View reviewed changes

zeroshade merged commit 0b6596c into apache:main Jan 13, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve manifest scanning organization and concurrency #252

refactor: improve manifest scanning organization and concurrency #252

iskakaushik commented Jan 12, 2025 •

edited

Loading

zeroshade Jan 13, 2025

iskakaushik Jan 13, 2025

zeroshade Jan 13, 2025

zeroshade left a comment

refactor: improve manifest scanning organization and concurrency #252

refactor: improve manifest scanning organization and concurrency #252

Conversation

iskakaushik commented Jan 12, 2025 • edited Loading

zeroshade Jan 13, 2025

Choose a reason for hiding this comment

iskakaushik Jan 13, 2025

Choose a reason for hiding this comment

zeroshade Jan 13, 2025

Choose a reason for hiding this comment

zeroshade left a comment

Choose a reason for hiding this comment

iskakaushik commented Jan 12, 2025 •

edited

Loading