Pass opline as argument to opcode handlers in CALL VM #17952

arnaud-lb · 2025-02-28T13:17:43Z

Tail call VM #17849

This extracts the part of #17849 that passes opline as opcode handler argument in the Call VM. This should reduce the size of #17849, and also unify the Hybrid and Call VMs slightly.

Currently we have two VMs:

Hybrid: Used when compiling with GCC, execute_data and opline are global register variables
Call: Used when compiling with something else, execute_data is passed as opcode handler arg, and opline is loaded/stored from/to execute_data->opline.

The Call VM looks like this:

while (1) {
    ret = execute_data->opline->handler(execute_data);
    if (UNEXPECTED(ret != 0)) {
        if (ret > 0) { // returned by ZEND_VM_ENTER() / ZEND_VM_LEAVE()
            execute_data = EG(current_execute_data);
        } else {       // returned by ZEND_VM_RETURN()
            return;
        }
    }
}

// example op handler
int ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data) {
    // load opline
    const zend_op *opline = execute_data->opline;

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    execute_data->opline++;
    return 0; // ZEND_VM_CONTINUE()
}

Opcode handlers return a positive value to signal that the loop must load a new execute_data from EG(current_execute_data), typically when entering or leaving a function.

Changes in this PR

Pass opline as opcode handler argument
Return opline from opcode handlers
ZEND_VM_ENTER / ZEND_VM_LEAVE return opline | (1<<63) to signal that execute_data must be reloaded from EG(current_execute_data)

This gives us:

while (1) {
    opline = opline->handler(execute_data, opline);
    if (UNEXPECTED((intptr_t) opline <= 0) {
        if (opline != 0) { // returned by ZEND_VM_ENTER() / ZEND_VM_LEAVE()
            opline = opline & ~(1<<63);
            execute_data = EG(current_execute_data);
        } else {           // returned by ZEND_VM_RETURN()
            return;
        }
    }
}

// example op handler
const zend_op * ZEND_INIT_FCALL_SPEC_CONST_HANDLER(zend_execute_data *execute_data, const zend_op *opline) {
    // opline already loaded

    // instruction execution

    // dispatch
    // ZEND_VM_NEXT_OPCODE():
    return ++opline;
}

In addition to making the changes of #17849 smaller and to unifying VMs slightly, this improves performances of the Call VM:

bench.php is 23% faster:

; hyperfine -L php no-opline,opline-arg '/tmp/{php}/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php'
Benchmark 1: /tmp/no-opline/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php
  Time (mean ± σ):     549.4 ms ±   4.7 ms    [User: 536.3 ms, System: 11.8 ms]
  Range (min … max):   542.2 ms … 558.6 ms    10 runs
 
Benchmark 2: /tmp/opline-arg/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php
  Time (mean ± σ):     445.5 ms ±   2.7 ms    [User: 432.5 ms, System: 12.0 ms]
  Range (min … max):   441.3 ms … 449.0 ms    10 runs
 
Summary
  /tmp/opline-arg/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php ran
    1.23 ± 0.01 times faster than /tmp/no-opline/sapi/cli/php -n -d zend_extension=opcache.so -d opcache.enable_cli=1 --repeat 3 Zend/bench.php

Symfony Demo is 2.8% faster:

base:        mean:  0.5373;  stddev:  0.0004;  diff:  -0.00%
opline-arg:  mean:  0.5222;  stddev:  0.0008;  diff:  -2.82%

JIT

When using the Hybrid VM, JIT stores execute_data/opline in two fixed callee-saved registers and rarely touches EX(opline), just like the VM.

Since the registers are callee-saved, the JIT'ed code doesn't have to save them before calling other functions, and can assume they always contain execute_data/opline. The code also avoids saving/restoring them in prologue/epilogue, as execute_ex takes care of that (JIT'ed code is called exclusively from there).

When using the Call VM, we can do that, too, except that we can't rely on execute_ex to save the registers for us, as it may use these registers it self. So we have to save/restore the two registers in JIT'ed code prologue/epilogue.

TODO

Test x86
Test aarch64

So that execute_ex/opline are always passed via (the same) registers

Instead of returning -opline. This simplifies JIT as we can simply return -1 for VM_ENTER/VM_LEAVE. However, this implies that EX(opline) must be in sync.

This reverts commit 9aaceaf6a2480836ae59d9657d37b10bbe04268e.

This simplies JIT compared to returning -opline, and doesn't require EX(opline) to be in sync.

Saving them in execute_ex is not safe when not assigning them to global regs, as the compiler allocate them.

dstogov

This is really interesting. This will work with MSVC as well, right?

On bench.php callgrind shows significant improvement without JIT, and slight improvement with tracing JIT.

As I understood, the main improvement comes from elimination of EX(opline) loading in each instruction handler (elimination of USE_OPLINE) and improvement of prologues and epilogues of short handlers. Right? May be I missed something?

I'm really surprised in the effect. I understand the improvement coming from caching EX(opline) in a preserved CPU register, but in my opinion, passing and returning it across all handlers should smooth the effect from load elimination. It seems I was wrong.

I think, this should be finalized, carefully reviewed and merged.

dstogov · 2025-03-03T08:40:09Z

Zend/zend_vm_execute.h

-# define ZEND_OPCODE_HANDLER_ARGS zend_execute_data *execute_data
-# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU execute_data
-# define ZEND_OPCODE_HANDLER_ARGS_DC , ZEND_OPCODE_HANDLER_ARGS
-# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU_CC , ZEND_OPCODE_HANDLER_ARGS_PASSTHRU
+# define ZEND_OPCODE_HANDLER_ARGS zend_execute_data *execute_data, const zend_op *opline
+# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU execute_data, opline
+# define ZEND_OPCODE_HANDLER_ARGS_DC ZEND_OPCODE_HANDLER_ARGS, 
+# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU_CC ZEND_OPCODE_HANDLER_ARGS_PASSTHRU, 


Originally PHP used _D, _C, _DC and _CC suffixes for macros that implements implementation defined arguments. I can't remember what does they mean. Probably C - comma and D - declaration.
@derickr do you remeber?

Probably, we shouldn't use the same convention for something else.

I agree. I'm not sure this change (commit 7d7be6d) makes a difference anymore, so I may revert it.

dstogov · 2025-03-03T08:45:38Z

Zend/zend_vm_execute.h

+# define ZEND_VM_ENTER_BIT         (1ULL<<(UINTPTR_WIDTH-1))
+# define ZEND_VM_ENTER_EX()        return (zend_op*)((uintptr_t)opline | ZEND_VM_ENTER_BIT)


The assumption that the high bit of address must be zero may be wrong.

I think it's right at least on x86_64, but I will check other platforms. I hope I can keep this scheme as it allows this check to be a single instruction:

php-src/Zend/zend_vm_gen.php

Line 2150 in 3ce2089

out($f, $m[1]."if (UNEXPECTED((intptr_t)OPLINE <= 0))".$m[3]."\n");

dstogov · 2025-03-03T08:57:00Z

Zend/zend_vm_execute.h

 #ifdef ZEND_VM_FP_GLOBAL_REG
 			execute_data = vm_stack_data.orig_execute_data;
 # ifdef ZEND_VM_IP_GLOBAL_REG
 			opline = vm_stack_data.orig_opline;
 # endif
 			return;
 #else
-			if (EXPECTED(ret > 0)) {
+			if (EXPECTED(opline != NULL && (uintptr_t)opline != ZEND_VM_ENTER_BIT)) {


This check must be more expensive...
The second part looks incorrect. I suppose it should check only a single bit.

dstogov · 2025-03-03T09:01:19Z

ext/opcache/jit/zend_jit.c

-static int ZEND_FASTCALL zend_runtime_jit(void)
+static ZEND_OPCODE_HANDLER_RET ZEND_FASTCALL zend_runtime_jit(ZEND_OPCODE_HANDLER_ARGS)


Ops. It looks like the previous prototype was wrong.

dstogov · 2025-03-03T09:06:24Z

ext/opcache/jit/zend_jit.c

+#if GCC_GLOBAL_REGS
+	return; // ZEND_VM_CONTINUE
+#else
+	return op_array->opcodes; // ZEND_VM_CONTINUE


We may skip leading RECV instructions. Actually, you should return the opline passed as argument or EX(opline).

dstogov · 2025-03-03T09:11:12Z

ext/opcache/jit/zend_jit_vm_helpers.c

-#ifndef HAVE_GCC_GLOBAL_REGS
-				opline = EX(opline);
-#endif
-


Will this work with other VM types (GOTO and SWITCH)?

I will check!

arnaud-lb · 2025-03-03T11:11:27Z

This is really interesting. This will work with MSVC as well, right?

I assume so, but I still need to test on Windows/MSVC, x86, aarch64.

On bench.php callgrind shows significant improvement without JIT, and slight improvement with tracing JIT.

As I understood, the main improvement comes from elimination of EX(opline) loading in each instruction handler (elimination of USE_OPLINE) and improvement of prologues and epilogues of short handlers. Right? May be I missed something?

I'm really surprised in the effect. I understand the improvement coming from caching EX(opline) in a preserved CPU register, but in my opinion, passing and returning it across all handlers should smooth the effect from load elimination. It seems I was wrong.

Yes, it is my intuition that eliminating the load/stores of EX(opline) is what yields the improvements.

A comparison of the annotated code of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER (the hottest handler in bench.php) between branches shows that we have less instructions in the current branch, and less memory accesses:

Base branch:

Dump of assembler code for function ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER:
   0x0000000000aa5a10 <+0>:	mov    (%rdi),%rax          // opline = EX(opline)
   0x0000000000aa5a13 <+3>:	movslq 0x8(%rax),%rcx
   0x0000000000aa5a17 <+7>:	movslq 0xc(%rax),%rdx
   0x0000000000aa5a1b <+11>:	movslq 0x10(%rax),%rsi
   0x0000000000aa5a1f <+15>:	movsd  (%rdi,%rcx,1),%xmm0
   0x0000000000aa5a24 <+20>:	addsd  (%rdi,%rdx,1),%xmm0
   0x0000000000aa5a29 <+25>:	movsd  %xmm0,(%rdi,%rsi,1)
   0x0000000000aa5a2e <+30>:	movl   $0x5,0x8(%rdi,%rsi,1)
   0x0000000000aa5a36 <+38>:	add    $0x20,%rax          // opline++
   0x0000000000aa5a3a <+42>:	mov    %rax,(%rdi)         // EX(opline) = opline
   0x0000000000aa5a3d <+45>:	xor    %eax,%eax           // ret = 0
   0x0000000000aa5a3f <+47>:	ret

Current branch:

Dump of assembler code for function ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER:
   0x0000000000aa68a0 <+0>:	movslq 0x8(%rsi),%rax
   0x0000000000aa68a4 <+4>:	movslq 0xc(%rsi),%rcx
   0x0000000000aa68a8 <+8>:	movslq 0x10(%rsi),%rdx
   0x0000000000aa68ac <+12>:	movsd  (%rdi,%rax,1),%xmm0
   0x0000000000aa68b1 <+17>:	addsd  (%rdi,%rcx,1),%xmm0
   0x0000000000aa68b6 <+22>:	movsd  %xmm0,(%rdi,%rdx,1)
   0x0000000000aa68bb <+27>:	movl   $0x5,0x8(%rdi,%rdx,1)
   0x0000000000aa68c3 <+35>:	lea    0x20(%rsi),%rax    // ret = ++opline
   0x0000000000aa68c7 <+39>:	ret

It's possibly slightly slower than using global fixed registers, but in comparison to using EX(opline), it's almost equivalent. In the fast path, opline will just be held in either %rsi or %rax and will not be spilled. It will need to be moved back and forth between the two registers when returning from and calling op handlers, but this is less expensive than loading/storing EX(opline) and uses less instructions. We occasionally need to preserve %rsi or %rax before function calls, but this tends to happen only in slow paths, and the save can be in an other register, not necessarily on the stack.

In the case of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER, opline remains in %rsi, and is only moved to %rax before ret: lea 0x20(%rsi),%rax.

dstogov · 2025-03-03T12:09:01Z

Yes, it is my intuition that eliminating the load/stores of EX(opline) is what yields the improvements.

Right. I missed store. And this should make the biggest impact to real-life performance.

A comparison of the annotated code of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER (the hottest handler in bench.php).

These specialized instructions are rare on real-life loads. Your tests showed relatively small improvement on Symfony. Note that finalizing may decrease this even more. On the other hand some related optimizations may be discovered.

It's possibly slightly slower than using global fixed registers, but in comparison to using EX(opline), it's almost equivalent. In the fast path, opline will just be held in either %rsi or %rax and will not be spilled. It will need to be moved back and forth between the two registers when returning from and calling op handlers, but this is less expensive than loading/storing EX(opline) and uses less instructions. We occasionally need to preserve %rsi or %rax before function calls, but this tends to happen only in slow paths, and the save can be in an other register, not necessarily on the stack.

In the case of ZEND_ADD_DOUBLE_SPEC_TMPVARCV_TMPVARCV_HANDLER, opline remains in %rsi, and is only moved to %rax before ret: lea 0x20(%rsi),%rax.

I see, and this makes sense.

arnaud-lb added 13 commits February 25, 2025 15:39

Pass opline via op handler args

d5d7bff

Move extra vm helper parameters after standard ones

7d7be6d

So that execute_ex/opline are always passed via (the same) registers

Improve reproducibility

34f44f9

Update JIT for new op handler signature (wip)

3851c7d

Return -1 to signal execute_data reloading

0b4bfd1

Instead of returning -opline. This simplifies JIT as we can simply return -1 for VM_ENTER/VM_LEAVE. However, this implies that EX(opline) must be in sync.

Revert "Return -1 to signal execute_data reloading"

5002b0f

This reverts commit 9aaceaf6a2480836ae59d9657d37b10bbe04268e.

Use indempotent operation to signal execute_data reloading

dbcf4a1

This simplies JIT compared to returning -opline, and doesn't require EX(opline) to be in sync.

Save registers in generated code

144dd19

Saving them in execute_ex is not safe when not assigning them to global regs, as the compiler allocate them.

Force reload execute_data when returning from exception

f848d43

Cleanup

a63ef20

ZEND_VM_ENTER_BIT

70600f1

GCC build fix

a23e7c1

Generated file

095db03

github-actions bot added Category: Engine Extension: opcache ABI break labels Feb 28, 2025

Build fix

690ba41

arnaud-lb requested a review from dstogov February 28, 2025 13:26

Define CHAR_BITS

3ce2089

dstogov reviewed Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass opline as argument to opcode handlers in CALL VM #17952

Pass opline as argument to opcode handlers in CALL VM #17952

arnaud-lb commented Feb 28, 2025 •

edited

Loading

dstogov left a comment

dstogov Mar 3, 2025

arnaud-lb Mar 3, 2025

dstogov Mar 3, 2025

arnaud-lb Mar 3, 2025

dstogov Mar 3, 2025

dstogov Mar 3, 2025

dstogov Mar 3, 2025

dstogov Mar 3, 2025

arnaud-lb Mar 3, 2025

arnaud-lb commented Mar 3, 2025

dstogov commented Mar 3, 2025

		# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1))
		# define ZEND_VM_ENTER_EX() return (zend_op*)((uintptr_t)opline \| ZEND_VM_ENTER_BIT)

		static int ZEND_FASTCALL zend_runtime_jit(void)
		static ZEND_OPCODE_HANDLER_RET ZEND_FASTCALL zend_runtime_jit(ZEND_OPCODE_HANDLER_ARGS)

Pass opline as argument to opcode handlers in CALL VM #17952

Are you sure you want to change the base?

Pass opline as argument to opcode handlers in CALL VM #17952

Conversation

arnaud-lb commented Feb 28, 2025 • edited Loading

Changes in this PR

JIT

TODO

dstogov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaud-lb commented Mar 3, 2025

dstogov commented Mar 3, 2025

arnaud-lb commented Feb 28, 2025 •

edited

Loading