-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[FEA] Faster path for calculating total output symbols in FST #17114
Comments
After discussions with @elstehle, here's a brief overview of two possible approaches considered:
Though approach 2 has fewer global memory reads and writes, re-computing the number of translated characters per thread can be expensive. The plan is to start with approach 1 and implement |
For approach 2,
Instead, can we take approach similar to cub, where we can pass |
The output iterators don't have to be pointers but can be arbitrary iterators. It will be difficult to agree on a sentinel value (like I don't have a clear favorite. I think the advantage of the callback is that I expect it to be easier to implement, because we don't have to figure out the right way to properly re-enter the Here's an example function object implementation that we had in mind:
Another question that came up while I was thinking about all this: Do we still want to keep the code path for when we are running the FST in "one go" (i.e., continue using decoupled look-back to merge the output offset computation and writing the translated output)? In previous discussions I always assumed the answer was "yes", because I assume it to be more performant. I would like to hear how you are thinking about this? |
IMO, I think we can retain the decoupled look-back approach and introduce a 'low-memory' policy that adopts this new approach i.e. writing output offsets to global memory. |
Is your feature request related to a problem? Please describe.
To create a faster path for calculating total output symbols in FST, without redundantly computing some values, such as the state-transition vectors and write offsets in the FST.
Also, if the impact is significant, provide a way to optimize the FST kernel not to redundantly compute number of total output symbols, if it was passed with discard iterator.
Describe the solution you'd like
A reduce kernel to compute total number of output symbols.
Describe alternatives you've considered
used make_discard_iterator in #16978
The compiler should optimize and remove redundant parts, But still some redundant parts are still computed.
@elstehle shows it has 10% runtime impact. Impact on just FST alone would be much more.
Additional context
#16978 (comment) Benchmark shows addtional 10% impact on overall JSON reader runtime.
The text was updated successfully, but these errors were encountered: