Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Remove line packing from tiles, slightly improve tile culling #845

Merged
merged 3 commits into from
Mar 11, 2025

Conversation

tomcur
Copy link
Member

@tomcur tomcur commented Mar 10, 2025

This reduces the tile memory footprint and drops the requirement for tile x-coordinates to be signed. We may be able to drop the TileIndex indirection soon.

Coarse winding calculation is performed in tile generation, like it more-or-less was before. In principle it requires just a single bit in the tile packing. Calculating the coarse winding could be performed in the strip generation as well, but I suspect it to be slightly more performant this way. I haven't accurately measured that.

This is one step towards what Raph mentioned. Quoting:

pack x, y, line id, and winding number delta into a single u32 and sort that. This is a very tight packing with u32 and may overflow in some cases (very large number of lines per path), but I think should be very efficient. Then the determination of p0 and p1 within the tile happens after sorting. A major motivation for this change is that it can be done on GPU in hybrid modes, but the approach to parallelism also works with SIMD, I think.

Tiles above, to the right and below the viewport are now culled. Tiles to the left of the viewport are clamped but not culled yet. Currently strip generation has some code special-casing those tiles.

On my machine, paris-30k measures with #835 as the "before" and after the changes here as the following. (With 20 iterations, appears to be accurate to about ~0.2ms, but we should think about better benchmarking.)

Tile generation
    old: 24.394854ms
    new: 17.620834ms
Tile sorting
    old: 8.922622ms
    new: 8.146455ms
Strip generation
    old: 33.187645ms
    new: 31.825052ms

Comment on lines +81 to +87
let line = lines[tile.line_idx as usize];
let tile_left_x = tile.x as f32 * Tile::WIDTH as f32;
let tile_top_y = tile.y as f32 * Tile::HEIGHT as f32;
let p0_x = line.p0.x - tile_left_x;
let p0_y = line.p0.y - tile_top_y;
let p1_x = line.p1.x - tile_left_x;
let p1_y = line.p1.y - tile_top_y;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these calculations placed at the front of the loop, strip generation measures at ~1-2% faster for paris-30k on my machine.

Comment on lines +21 to +25
/// Whether the line crosses the top edge of the tile.
///
/// Lines making this crossing increment or decrement the coarse tile winding, depending on the
/// line direction.
pub winding: bool,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can become a single bit in a tight packing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, we had a delta which can be -1, 0, 1. Am I correct in the assumption that winding == true corresponds to either -1/1, the exact delta being determined after the sorting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, false corresponds to 0, and true corresponds to -1 if the line is oriented down and 1 if the line is oriented up. In this PR the sign is resolved during strip generation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, that's a neat trick, I love it. I was thinking along similar lines for the MSAA winding number accumulation in the sparse strip design (see doc).


/// A tile represents an aligned area on the pixmap, used to subdivide the viewport into sub-areas
/// (currently 4x4) and analyze line intersections inside each such area.
///
/// Keep in mind that it is possible to have multiple tiles with the same index,
/// namely if we have multiple lines crossing the same 4x4 area!
#[derive(Debug, Clone)]
#[derive(Debug, Clone, Copy)]
pub struct Tile {
/// The index of the tile in the x direction.
pub x: i32,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should become u16, probably as part of this PR, which I'd prefer doing as a separate commit (many non-interesting changes will follow from that). With this PR currently being based on top of another PR, I'll hold off on that for a bit to keep things clear.

Copy link
Contributor

@LaurenzV LaurenzV Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end goal is that a Tile can be represented by just u32, right? This would mean that we need to first check whether any of the components can overflow, and if so use a "long code path" that doesn't contain this tight packing, right? In practice this would hopefully never be necessary unless there are a lot of lines in a single path or x/y are very big, but for correctness it would still be necessary to have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is best answered by @raphlinus: I had the same question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you'll overflow the u32 packing if the line count exceeds the space available, which I think will be rare, but there would need to be a fallback as it's not impossible.

I guess a major empirical question is whether sorting of u32's is actually much faster than u64's. If not, then it's probably better to avoid the complexity of having two code paths. This question is potentially going to depend on a lot of things, like the architecture it's running on and whether we try to do alternative sorting strategies like radix sort.

@tomcur tomcur requested review from raphlinus and LaurenzV March 10, 2025 15:05
@tomcur tomcur changed the title Remove line packing from tiles, slightly improve culling Remove line packing from tiles, slightly improve tile culling Mar 10, 2025
Copy link
Contributor

@LaurenzV LaurenzV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper review will wait until the previous PR has been merged for easiniess, but just a couple of (perhaps dumb) questions from my side. Otherwise, the overall direction seems sensible to me!


/// A tile represents an aligned area on the pixmap, used to subdivide the viewport into sub-areas
/// (currently 4x4) and analyze line intersections inside each such area.
///
/// Keep in mind that it is possible to have multiple tiles with the same index,
/// namely if we have multiple lines crossing the same 4x4 area!
#[derive(Debug, Clone)]
#[derive(Debug, Clone, Copy)]
pub struct Tile {
/// The index of the tile in the x direction.
pub x: i32,
Copy link
Contributor

@LaurenzV LaurenzV Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end goal is that a Tile can be represented by just u32, right? This would mean that we need to first check whether any of the components can overflow, and if so use a "long code path" that doesn't contain this tight packing, right? In practice this would hopefully never be necessary unless there are a lot of lines in a single path or x/y are very big, but for correctness it would still be necessary to have.

Comment on lines +21 to +25
/// Whether the line crosses the top edge of the tile.
///
/// Lines making this crossing increment or decrement the coarse tile winding, depending on the
/// line direction.
pub winding: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, we had a delta which can be -1, 0, 1. Am I correct in the assumption that winding == true corresponds to either -1/1, the exact delta being determined after the sorting?

Comment on lines -140 to -178
let nudge_point = |p: Point| -> Point {
// Lines that cross vertical tile boundaries need special treatment during
// anti-aliasing. This case is detected via tile-relative x == 0. However,
// lines can naturally start or end at a multiple of the 4x4 grid, too, but
// these don't constitute crossings. We nudge these points ever so slightly,
// by ensuring that xfrac0 and xfrac1 are always at least 1/8192 of a pixel.
// By doing so, whenever we encounter a point
// at a tile relative 0, we can treat it as an edge crossing. This is somewhat
// of a hack and in theory we should rather solve the underlying issue in the
// strip generation code, but it works for now.
if p.x.fract() == 0.0 {
Point {
x: p.x + SCALED_X_NUDGE_FACTOR,
y: p.y,
}
} else {
p
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is not necessary anymore due to your new strip generation algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've attempted to encode as many of the edge-cases into IEEE 754 float semantics as I could (mostly around inf and nan), to have as few branches and instructions as possible. It could perhaps be improved more, but I haven't found that yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines +139 to +141
// TODO: Tiles are clamped to the left edge of the viewport, but lines fully to the left of the
// viewport are not culled yet. These lines impact winding, and would need forwarding of
// winding to the strip generation stage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you mentioned x can be stored as u16, so how do you distinguish between lines at 0 and lines strictly to the left?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds logic to strip generation to check for left-of-viewport lines. With some plumbing of winding/coverage from earlier pipeline stages into strip generation, those checks are best moved back to tile generation in a follow-up PR. I believe Raph would like to cull some geometry even earlier than tile generation in future changes, which might be able to use some of that same plumbing.

(p1_y, p1_x, p0_y, p0_x)
};

// For ease of logic, special-case purely vertical tiles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are purely horizontal lines not worth special-casing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are worth special-casing. Horizontal geometry is parallel to the winding scan direction, and so doesn't impact winding. We should evaluate removing them entirely. That does most likely entail adding a sparse alpha mask draw command to the wide tiles.


last_packed = Point::new(packed.x + flip, packed.y);
} else {
let x_slope = (p1_x - p0_x) / (p1_y - p0_y);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be NaN for horizontal lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but it can be inf. The min and max below take care of that. Admittedly, I haven't verified that too closely since horizontal lines don't impact geometry and can be elided, which could be done shortly after this lands in a follow-up PR.

@tomcur
Copy link
Member Author

tomcur commented Mar 10, 2025

Spot-checking some pixels, the changed tests all appear to just be the rounding of pixel coverage falling differently. I haven't checked too closely, but it's possible the area coverage calculation is by coincidence slightly more symmetrical now, check e.g. the left and right edges of eo_filling_missing_anti_aliasing.png before and after these changes.

Comment on lines 212 to 223
let line = line_buf[tile.line_idx as usize];

// TODO: how to handle line intersections now lines are not explicitly segmented by tile
// generation anymore?
let p0 = Point {
x: line.p0.x - x as f32,
y: line.p0.y - y as f32,
};
let p1 = Point {
x: line.p1.x - x as f32,
y: line.p1.y - y as f32,
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hacky at the moment to get tests to pass. Lines are no longer segmented per tile explicitly, so this is drawing the same line multiple tiles. @LaurenzV do you have some ideas about this? Segmenting here manually might be a bit misleading, as with the code in this PR that doesn't actually happen internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with removing the tile intersection points if the underlying algorithm is different now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the intersections, but am opening a follow-up soon after this.

Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, and the measured time changes justify the change. It's also consistent with changes I proposed for the sorting.

There's definitely more work to be done to get to the end state, but this seems like a good step.

tomcur and others added 2 commits March 11, 2025 11:02
This reduces the tile memory footprint and drops the requirement for
tile x-coordinates to be signed. We may be able to drop the `TileIndex`
indirection soon.

Coarse winding calculation is performed in tile generation, like it
more-or-less was before. In principle it requires just a single bit in
the tile packing. Calculating the coarse winding could be performed in
the strip generation as well, but I suspect it to be slightly more
performant this way. I haven't accurately measured that.

This is along the same lines as what
[Raph mentioned](https://xi.zulipchat.com/#narrow/channel/197075-gpu/topic/CPU.20sparse.20strip.20rendering.20to.20pixels/near/500019583).
Quoting:

> pack x, y, line id, and winding number delta into a single u32 and
> sort that. This is a very tight packing with u32 and may overflow in
> some cases (very large number of lines per path), but I think should be
> very efficient. Then the determination of p0 and p1 within the tile
> happens after sorting. A major motivation for this change is that it can
> be done on GPU in hybrid modes, but the approach to parallelism also
> works with SIMD, I think.

Tiles above, to the right and below the viewport are now culled. Tiles
to the left of the viewport are clamped but not culled yet. Currently
strip generation has some code special-casing those tiles.

On my machine, paris-30k measures before and after the changes as the
following. (With 20 iterations, probably accurate to about ~0.2ms, but
we should think about better benchmarking.)

```
Tile generation
    old: 24.394854ms
    new: 17.620834ms
Tile sorting
    old: 8.922622ms
    new: 8.146455ms
Strip generation
    old: 33.187645ms
    new: 31.825052ms
```
@tomcur tomcur added this pull request to the merge queue Mar 11, 2025
@tomcur
Copy link
Member Author

tomcur commented Mar 11, 2025

To reduce some effort, tile x-coord data type clean up and packing the tile more tightly are best done together in a separate PR, I think.

Merged via the queue into main with commit 4a75ece Mar 11, 2025
17 checks passed
@tomcur tomcur deleted the unpack-lines branch March 11, 2025 11:04
tomcur added a commit that referenced this pull request Mar 11, 2025
#845 removed line-tile
intersections from the debug rendering, as these are no longer present
in the internal representation.

However, that also removed a way to see whether lines are directed
upwards or downwards, which is important to understand their effect on
winding.

This proposes coloring the lines according to their orientation: lines
oriented upwards add to winding and are colored green. Lines oriented
downwards subtract from winding and are colored red. Horizontal lines
don't affect winding, and will with future changes likely not generate
any tiles at all. These are colored grey.
github-merge-queue bot pushed a commit that referenced this pull request Mar 11, 2025
#845 removed line-tile
intersections from the debug rendering, as these are no longer present
in the internal representation.

However, that also removed a way to see whether lines are directed
upwards or downwards, which is important to understand their effect on
winding.

This proposes coloring the lines according to their orientation: lines
oriented upwards add to winding and are colored green. Lines oriented
downwards subtract from winding and are colored red. Horizontal lines
don't affect winding, and will with future changes likely not generate
any tiles at all. These are colored grey.

An example of what that looks like:

```shell
cargo run --bin debug -- --path "M 5 5 L 40 23 L 20 43.5 L 1 43.5 Z"  --stages ls,ta
```


![debug](https://github.com/user-attachments/assets/ee4600a8-eb03-4cef-9796-a86cafec9526)
let h = (ymax - ymin).max(0.);
accumulated_winding[y_idx as usize] += sign * h;

for x_idx in 0..Tile::HEIGHT {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomcur should this have said Tile::WIDTH?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you! Opened #848.

github-merge-queue bot pushed a commit that referenced this pull request Mar 12, 2025
Missed in #845, spotted by
@dominikh
([comment](#845 (comment))).

Co-authored-by: Dominik Honnef <dominik@honnef.co>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants