PHP refcount as optimization
The PHP refcount mechanism can be used for optimizations.
It's been used for that purpose for years. When the code needs to change one of the two by-value types – string or array – first it checks the number of references and when the value of the refcount equals one, meaning only the current code holds the reference and change will be seen by no one else. In this case, it can modify the data of the strings and arrays directly in-place without allocating a copy first.
But it can also be used directly.
For example the function unpack
for parsing binary data must return a array (for some reason indexed from one) even if it parsed only one int. As a result, the simple unpack('v', $str)[1]
takes 75 ns, which is shamefully
slow. Essentially, it's just three instructions: cmp
+ jb
to check if
there's enough bytes in the string and one mov
. This should not take 75 ns
but one clock cycle.
However, I can optimize a case when I'm parsing just one integer a little bit. If I take advantage of the fact that the array is by-value type, I can implement it like this (pseudo-code, it would be a native function):
function unpack($str) { static $arr1 = new array [1 => null]; if (refcount($arr) > 1) { $arr1 = new array [1 => null]; } $arr1[1] = chomp2Bytes($str); return $arr1; }
The changed `unpack' function would internally always hold one reference to the array and repeatedly use it as the return value.
One of two situations may happen:
- The caller code changes the returned array. We don't mind that because the
array is a by-value and the program can change such objects only in-place if
they have refcount = 1 (see macro
SEPARARATE_ARRAY
). And sinceunpack
always holds one reference and the caller code the other, the array must be copied and only the copy will be changed. The version held by theunpack
will be unmodified. - The caller code saves the returned array in a variable. This situation we
detect directly in the next call to
unpack
because suddenly the refcount is higher than 1 and we simply allocate a new array. But this happens very rarely, at least for simple cases, when I extract only one int and I don't use the syntax for named results à launpack('Vid\Vtimestamp\vflags')
.
With this optimization, unpack('v', $str)[1]
takes 50 ns (time measured in the
interpreter, JIT turned off). One third better, but still shamefully slow.
The main problem with the unpack
function is its complex semantic, where it
has to parse a pattern string (in this case just 'v'
, but unpack
has to be
able to handle all cases with complicated patterns like
unpack('Vid/Vdate/Ccount/V*children', $str)
and then, in a gigantic switch,
jump on those few instructions that do actual useful work interpreting
some number of bytes as an int or float of a given length, a given signedness
and a given endiannes. This logic ale will take good 25 ns.
That was, in the end, the reason I'm writing this as an article and not as a PHP patch. Even with an effort of all my strength, I can't get into numbers that I wouldn't have to be ashamed of.
To give you an idea of how the unpack
function stands compared to the native
alternative (test code here, PHP JIT disabled, CPU i5-4570):
* ns/iter empty loop 7.87 noop() 15.20 unpackInt32LE() 16.26 unpack('V', $str)[1] 72.18
The first line is an empty loop to estimate the necessary overhead of the interpreter.
noop()
calls a native function that does nothing. This tests the overhead of
the loop + the call of a native functions.
unpackInt32LE
is a native function that takes 4 bytes from the string and
interprets them as little endian integer.
It is the following function (in language D), which is exported to PHP via my (so far private) library for easy creation of PHP extensions.
uint unpackInt32LE(scope const(ubyte)[] str) { pragma(inline, true); if (str.length < 4) throw new Exception("not enough bytes"); return *(cast(uint*) str.ptr); }
The last line unpack('V', $str)[1]' includes time of iteration, native
function calls, and the intrinsic
unpack' machinations.
The int extraction itself (+ parsing function arguments) takes up one
nanosecond in this case, loop overhead + the native function call is 15× larger
and the unpack
itself is still 5× slower than this. There's no point in
trying to optimize the inefficiently designed function when the native option
is so much faster and just using FFI, without any problems surpasses the speed
of unpack
.