Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

vello_hybrid implementation #831

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

grebmeg
Copy link
Contributor

@grebmeg grebmeg commented Mar 5, 2025

Based on #818: This PR adds a new hybrid sparse strip CPU/GPU renderer (vello_hybrid). The CPU handles path processing and initial geometry setup, while the GPU accelerates compositing and blending operations. The architecture consists of RenderContext for rendering operations, Renderer for GPU resource management, and RenderData container for processed geometry. The implementation supports both windowed and headless rendering, includes various examples, and moves some common functionality to shared modules.

# Render svg to file
cargo run -p vello_hybrid --example render_to_file examples/assets/Ghostscript_Tiger.svg target/Ghostscript_Tiger_VH.png

Ghostscript_Tiger_VH

# Render svg to window
cargo run -p vello_hybrid --example simple
cargo run -p vello_hybrid --example svg examples/assets/Ghostscript_Tiger.svg
image

raphlinus and others added 29 commits March 10, 2025 18:42
This brings in the cpu-sparse prototype from the piet-next branch of the piet repo. No substantive changes, but cpu-sparse is renamed vello_hybrid and piet-next is renamed vello_api.

Quite a bit of editing to satisfy the lint monster.

There was a half-written SIMD implementation of flattening, that's removed. It should be finished and re-added, as it's a good speedup.
Renders a simple scene to the GPU, first by doing coarse rasterization the same as cpu-sparse, then doing a single draw call.
Adds a clip method to the (CPU) render context, plus a considerable amount of mechanism in coarse and fine rasterization to support clipping.

The coarse rasterization logic contains a similar set of optimizations as Vello. In particular, all-zero tiles have drawing suppressed, and all-one tiles pass drawing commands through with no additional work to clip.

Not extensively validated, but it does render a simple scene with clipping correctly.
This reverts commit 248d08d536eb818f21020c753465469741e73d6c.
@grebmeg grebmeg force-pushed the vello-sparse-strips-hybrid-clean-up branch from 25e9650 to fba44cd Compare March 11, 2025 10:51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is running into gfx-rs/wgpu#6997. I've put a note on the mentioned backport into the wgpu maintainer meeting notes, and it doesn't actually break anything (it only means we don't get a clean shutdown)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🙏 I’m wondering which platform you encountered the issue on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue occurs on Wayland.

Comment on lines 215 to 222
let strip = GpuStrip {
x: tile_x as u16,
y: tile_y as u16,
width: WIDE_TILE_WIDTH as u16,
dense_width: 0,
col: 0,
rgba: bg,
};
Copy link
Member

@tomcur tomcur Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter at this stage, but to be robust for the rare case where we're also running on big-endian hosts, these should be u16::from_ne_bytes((tile_x as u16).to_le_bytes()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing that up! Would you prefer me to add it now, or wait until the issue arises so that we can fix it along with any other potential related cases?

Comment on lines 252 to 254
width: cmd_strip.width as u16,
dense_width: cmd_strip.width as u16,
col: cmd_strip.alpha_ix as u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current commands, width and dense_width are always either the same, or dense_width is 0. If that stays this way, the absence of an alpha mask could be encoded as, say, col: u32::MAX. On the other hand, your current encoding would allow merging alpha fills and sparse fills into one command, meanings paths would be drawn with at most two vertex instances per row per wide tile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, your current encoding would allow merging alpha fills and sparse fills into one command, meanings paths would be drawn with at most two vertex instances per row per wide tile

Could you clarify this part? Are you referring to a more advanced batching logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current preference would be to have a:

#[cfg(all(not(vello_big_endian_unchecked), target_endian = "big"))]
compile_fail!("Vello currently does not support big endian targets. Enable the `vello_big_endian_unchecked` cfg flag to try and run it anyway.");

This is however better as a follow-up.

Copy link
Member

@tomcur tomcur Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this part? Are you referring to a more advanced batching logic?

It could be done in batching, but the logic may fit better in command generation (after strip rendering). The idea is that if you're encoding width and dense_width for draw commands, you can fold CmdFill and CmdAlphaFill into a single command. That would reduce the number of vertex instances required to often be two per path per wide tile, reducing uploads to the GPU. You could go one further and encode a dense_offset as well, often requiring just one vertex instance per wide tile for drawing paths.

I've added a note to the renderer office hours agenda.

width: fill.width as u16,
dense_width: 0,
col: 0,
rgba: color.to_rgba8().to_u32(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be the following.

Suggested change
rgba: color.to_rgba8().to_u32(),
rgba: color.premultiply().to_rgba8().to_u32(),

The same for the AlphaFill command below and the bg above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Fixed in 4485f2e

@tomcur
Copy link
Member

tomcur commented Mar 11, 2025

Ghostscript_Tiger_VH

Looking at the pixel coverage of the whiskers (and pixel RGBA values of those), I believe the image is missing unpremultiplication.

Copy link
Contributor Author

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the pixel coverage of the whiskers (and pixel RGBA values of those), I believe the image is missing unpremultiplication.

Fixed in cecfb41


// Create initial texture for alpha values
// It will be recreated if needed in prepare
let initial_alpha_texture_width = 64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

Fixed in fa1d947

// Create initial texture for alpha values
// It will be recreated if needed in prepare
let initial_alpha_texture_width = 64;
let initial_alpha_texture_height = 64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could pass the count of processed strips when initializing Renderer, but I’m a bit hesitant to tie Renderer::new directly to the data. Since we have prepare method, it might be a better place to consolidate all resource management logic. What do you think?

Something like in that 8f06569 commit

Comment on lines 387 to 391
self.queue.write_buffer(
&self.strips_buffer,
0,
bytemuck::cast_slice(&render_data.strips),
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, we should explore that in a separate PR. I suspect it will provide a greater performance boost as strips_buffer grows larger. However, I’d like to gather some concrete data, so my initial focus will be on introducing performance measurements and then we can explore that. How does it sound?

Added TODO in 2c3de83

Comment on lines 411 to 414
assert!(
alpha_texture_height <= max_texture_dimension_2d,
"Alpha texture height exceeds WebGL2 limit"
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Yeah, this approach increases storage by a factor of 4. Now, I’m using the Rgba32Uint texture format, which efficiently packs all u32 alpha values. I checked the WebGL2 spec (search for RGBA32UI), and it looks like this format is supported.

Fixed in beebbc0

/// Stroke a path with the current paint and stroke settings.
pub fn stroke_path(&mut self, path: &BezPath) {
flatten::stroke(path, &self.stroke, self.transform, &mut self.line_buf);
self.render_path(Fill::NonZero, self.paint.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right, and you're correct that this approach has implications for batching. The current architecture processes and finalizes each path independently. In a more optimized renderer, we might batch similar paint operations to reduce state changes and improve GPU utilization. However, this would require more complex state management and potentially more memory usage to track all paths before rendering. Also, for the web environment, we are focusing on the simplest solution, which is single-threaded.

Comment on lines 215 to 222
let strip = GpuStrip {
x: tile_x as u16,
y: tile_y as u16,
width: WIDE_TILE_WIDTH as u16,
dense_width: 0,
col: 0,
rgba: bg,
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing that up! Would you prefer me to add it now, or wait until the issue arises so that we can fix it along with any other potential related cases?

width: fill.width as u16,
dense_width: 0,
col: 0,
rgba: color.to_rgba8().to_u32(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Fixed in 4485f2e

Comment on lines 252 to 254
width: cmd_strip.width as u16,
dense_width: cmd_strip.width as u16,
col: cmd_strip.alpha_ix as u32,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, your current encoding would allow merging alpha fills and sparse fills into one command, meanings paths would be drawn with at most two vertex instances per row per wide tile

Could you clarify this part? Are you referring to a more advanced batching logic?

// Fallback, should never happen
default: { return rgba.x; }
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a final newline. You can normally configure your editor to add these automatically.

Unfortunately, I'm not aware of a good formatter for WGSL files, so this can't really be checked on CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad! I missed that.

You can normally configure your editor to add these automatically.

Yeah, I recently switched to vscode for some projects and haven’t configured everything yet.

Fixed in 93038ab

@grebmeg grebmeg requested review from DJMcNab, taj-p and tomcur March 12, 2025 10:05
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mentioned a blocking concern in #gpu > Vello Hybrid (of incorrect rendering after the first frame). Once we get that resolved, I'm happy for this to land from my perspective.

The GPU resource management story is something that makes me quite sad (because the existing state in Vello is not great, and that's been brought in here); there are quite a few changes I want to make here. However, those are better as follow-ups.

}

/// Fill a rectangle with the current paint and fill rule.
pub fn fill_rect(&mut self, rect: &Rect) {
Copy link

@taj-p taj-p Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we re-expect kurbo so that consumers don't need to pull it in themselves? e.g.

pub use peniko::kurbo;

}

/// Prepare the GPU buffers for rendering
pub fn prepare(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn prepare(
fn prepare(

Currently this is only called in render_to_texture. Maybe we should keep it private until we think it should be exposed?


/// Options for the renderer
#[derive(Debug)]
pub struct RendererOptions {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of providing a default implementation for RendererOptions?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants