Skip to content

Optimize surrogate decoding. #894

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Optimize surrogate decoding. #894

wants to merge 2 commits into from

Conversation

lrhn
Copy link
Member

@lrhn lrhn commented May 26, 2025

Use char ^ 0xD800 <= 0x3FF to check if a char code is a lead surrogate. That avoids doing a later & 0x3FF to get rid of the top bits. Similar for tail surrogate.

This ensures that the high function gets values without high bits, which makes it smaller (it tries to get inlined, so a little smaller counts).

Also optimize that function to reduce dependency depth and try to hit base + (something < small) expressions that can optimized into a single x64 address computation.

Gives a ~7% increase on backwards traversal and 30% increase for forward traversal, based on tool/benchmark.dart compiled with dart compile exe.
Actually a small decrease in performance on web for forward iteration, and a small increase for backwards iteration, and Wasm follows Web in performance here.

(Also found a bug in the generator, which hasn't worked since it was last committed.)

Interestingly, the change makes little-to-no difference on the benchmark/benchmark.dart benchmark.
(Maybe even makes it a little slower.)

Use `char ^ 0xD800 <= 0x3FF` to check if a char code is a lead
surrogate. That avoids doing a later `& 0x3FF` to get rid of the
top bits. Similar for tail surrogate.

This ensures that the `high` function gets values without high
bits.
Also optimize that function to reduce dependency depth and
try to hit `base + (something < small)` expressions that can
optimized into a single x64 address computation.

Gives a ~7% increase on backwards traversal and 38% increase for
forward traversal, based on tool/benchmark.dart compiled with
`dart compile exe`.
@lrhn lrhn force-pushed the characters-opt branch from fb93992 to 0c4d7b3 Compare May 26, 2025 13:55
Copy link

Package publishing

Package Version Status Publish tag (post-merge)
package:args 2.7.0 already published at pub.dev
package:async 2.13.1-wip WIP (no publish necessary)
package:characters 1.4.1 ready to publish characters-v1.4.1
package:collection 1.20.0-wip WIP (no publish necessary)
package:convert 3.1.3-wip WIP (no publish necessary)
package:crypto 3.0.7-wip WIP (no publish necessary)
package:fixnum 1.2.0-wip WIP (no publish necessary)
package:lints 6.0.1-wip WIP (no publish necessary)
package:logging 1.3.1-wip WIP (no publish necessary)
package:os_detect 2.0.4-wip WIP (no publish necessary)
package:path 1.9.2-wip WIP (no publish necessary)
package:platform 3.1.7-wip WIP (no publish necessary)
package:typed_data 1.4.1-wip WIP (no publish necessary)

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

Copy link

github-actions bot commented May 26, 2025

PR Health

Breaking changes ✔️
Package Change Current Version New Version Needed Version Looking good?
characters None 1.4.0 1.4.1 1.4.0 ✔️
Changelog Entry ✔️
Package Changed Files

Changes to files need to be accounted for in their respective changelogs.

Coverage ⚠️
File Coverage
pkgs/characters/lib/src/characters_impl.dart 💚 90 % ⬆️ 0 %
pkgs/characters/lib/src/grapheme_clusters/breaks.dart 💚 97 % ⬆️ 0 %
pkgs/characters/lib/src/grapheme_clusters/table.dart 💚 100 %
pkgs/characters/tool/bin/generate_tables.dart 💔 Not covered
pkgs/characters/tool/src/string_literal_writer.dart 💔 Not covered

This check for test coverage is informational (issues shown here will not fail the PR).

This check can be disabled by tagging the PR with skip-coverage-check.

API leaks ✔️

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbols
License Headers ✔️
// Copyright (c) 2025, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
no missing headers

All source files should start with a license header.

Copy link
Member

@mosuem mosuem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Although I don't have much knowledge of this package, and don't understand its intricacies. But what I understand makes sense to me.

var index = chunkStart + (tail & 255);
return _data.codeUnitAt(index);
var offset = (tail >> 8) + (lead << 2);
tail &= 255;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the assignment instead of the original chunkStart + (tail & 255)?

var chunkStart = _start.codeUnitAt(offset >> 8);
var index = chunkStart + (tail & 255);
return _data.codeUnitAt(index);
var offset = (tail >> 8) + (lead << 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this assumes that tail and lead don't need to be masked with 0x3ff. Should this be asserted here?

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants