Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Simd json encode #120

Open
wants to merge 46 commits into
base: master
Choose a base branch
from
Open

Simd json encode #120

wants to merge 46 commits into from

Conversation

nielsdos
Copy link
Owner

@nielsdos nielsdos commented Feb 3, 2025

TODO:

  • more performant resolver (too much call overhead, use template-style code) SEMI-DONE, it's not worth going for the extra few percentages because that will create a lot of code bloat.
  • verifications
  • bitset stuff: does the widening worsen performance? (A: no)
  • benchmarks

@mvorisek
Copy link

mvorisek commented Feb 4, 2025

I have no C toolchain setup so here is my idea in human words:

do {
#if simd support
    if (len >= sizeof(__m128i)) {
        calc mask;
        if (mask is "all not-to-be-escaped") {
            pos += sizeof(__m128i);
            continue;
        }
        mask_length = sizeof(__m128i);
    } else {
        set mask to "all maybe to-be-escaped"
        mask_length = len;
    }
    
    for (i = 0; i < mask_length; i++) {
        if (mask[i] == "not-to-be-escaped") {
            pos++; // OR i += -1 + , pos += "not-to-be-escaped" length calculated from mask if faster
            continue;
        }
        
        if (pos) {
            add unescaped;
        }

#endif
        handle single "maybe to-be-escaped" character as before;
#if simd support
    }
#endif
} while (len);

#if simd support
if (pos) {
    add unescaped;
}
#endif

The keypoint is to set/use mask even for non-SIMD (no SIMD support or too short input) to be able to process the "maybe to-be-escaped" characters at one place as before. That should generate shorter code, optimal performance for "never-to-be-escaped" chracters and hopefully introduce minimal overhead for other characters.

@nielsdos
Copy link
Owner Author

nielsdos commented Feb 4, 2025

A couple of points regarding your suggestion:

  • I already use the mask to know where to put the escape characters. I do this by looping over all set bits, appending in bulk from the input what we didn't append yet, and then escaping the new character. This code is here:
    do {
    /* Note that we shift the input forward, so we have to shift the mask as well,
    * beyond the to-be-escaped character */
    int len = zend_ulong_ntz(mask);
    mask >>= len + 1;
    smart_str_appendl(buf, s, len + pos);
    pos += len;
    us = (unsigned char)s[pos];
    s += pos + 1; /* skip 'us' too */
    pos = 0;
    bool handled = php_json_printable_ascii_escape(buf, us, options);
    ZEND_ASSERT(handled == true);
    } while (mask != 0);
  • The suggestion to use the mask even for short inputs will still make a slowdown. In my testing the overhead of looping over the bits and performing the check is higher than just doing a "dumb" byte per byte loop.
  • The remaining performance "issue" is regarding the case where we have to escape (almost) everything. The performance for not needing to escape is pretty good.
  • My code would've been closer to your pseudocode if we didn't need to cater for non-printable or UTF-8 characters. This adds an extra unavoidable complexity to the control flow of the algorithm.

@nielsdos
Copy link
Owner Author

nielsdos commented Feb 5, 2025

VTune shows some DSB stalls, that I have tried to improve by changing code layout. I also see some bad speculation (machine check) and data stalls in php_json_printable_ascii_escape (as expected) that I want to investigate.

@nielsdos
Copy link
Owner Author

nielsdos commented Feb 6, 2025

The overhead of the worst case with escapes is now only around 6%, so relatively small.
This is mostly due to tweaking of the algorithm, and tightening the code layout.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants