Don't believe that our hand-coded assembler optimisation was worthwhile?
We always value feedback on our presentations, but we were more than a little surprised to hear that some readers were doubtful about the potential of the performance increase gained by our hand optimisation of the bit-shifting code in our Candlelit Console kernel patchset.
We were always confident of the benefits, but since people have shown an interest, we decided to take some time out to do some benchmarks, and see just what was going on.
In the original article we mentioned:
`The bit shifting calculation above is actually much easier in X86 assembler than it is in C.'
=> Find the original article here
Actually, the main point that we intended to make here was that the split register concept which is fairly unique to X86 CPUs is not really taken advantage of by C compilers, but that it can nevertheless be made use of in cases were we are dealing with several 8-bit values packed into a 32-bit word, and when you happen to want to modify the element stored in the lower bits. This happened to be the blue channel in our 32-bit RGB value.
The `best' code that the C compiler produced with a short test program that did bit shifts on random numbers, seemed to be this:
movl %eax, %ecx # Copy the value returned from arc4random_uniform from eax to ecx andl $16776960, %ecx # ecx & 0xffff00 shrl %eax # eax >> 1 andl $127, %eax # eax & 0x7f leal (%rax,%rcx), %esi # esi = (rax + rcx)
It's difficult to believe that this could possibly be faster than a single opcode to shift the lower eight bits of the accumulator, even given the complex pipeling and caching that goes on within modern CPUs.
We'll test the two approaches in two different ways. First, a synthetic benchmark to test the raw speed of the corresponding instructions outside of a wider program. Next, we'll benchmark the bit-shifting within the actual rasops code where we originally put it for testing.
Note that in the final version of the Candlelit Console kernel patch, the bit-shifting was moved from the putchar32 function, where it was called for every single character output, into the device colour map initialisation code, which was then called every time our modified rasops_eraserows function was called. Whilst in the putchar function, the overhead could plausibly be significant, in this new location any overhead would be almost undetectable on any reasonably modern hardware, so any performance gain from using assembler is somewhat moot anyway, but it's the principle that we're discussing here.
Building a purely synthetic benchmark is fairly straightforward. We just need to loop a fixed number of times and shift the lower eight bits right by one bit on each iteration of the loop.
.section ".note", "a" .p2align 2 .long 0x08 .long 0x04 .long 0x01 .ascii "OpenBSD\0" .long 0x00 .p2align 2 .section .text .globl _start _start: /* Clear the accumulator */ xor %rax, %rax /* %rbx is our loop counter */ mov $0xFFFFFFFF, %rbx /* Now loop doing right shifts on the lower 8 bits of the accumulator: */ _loop: shrb %al dec %rbx cmp $0, %rbx jnz _loop /* Exit */ mov $0x01, %rax syscall
After assembling and linking the above program, we can time how long it takes to run to completion using the shell:
# as -o benchmark1.o benchmark1.S # ld --Bstatic --no-pie benchmark1.o # time ./a.out 0m00.95s real 0m00.95s user 0m00.00s system
Hand written assembler: 0m00.95s real
Now let's repeat the benchmark using code that follows what we saw the C compiler emit:
.section ".note", "a" .p2align 2 .long 0x08 .long 0x04 .long 0x01 .ascii "OpenBSD\0" .long 0x00 .p2align 2 .section .text .globl _start _start: /* Clear the accumulator */ xor %rax, %rax /* %rbx is our loop counter */ mov $0xFFFFFFFF, %rbx /* Now loop doing right shifts on the lower 8 bits of the accumulator: */ _loop: movl %eax, %ecx andl $0xffff00, %ecx shrl %eax andl $0x7f, %eax leal (%rax, %rcx), %esi movl %esi, %eax dec %rbx cmp $0, %rbx jnz _loop /* Exit */ mov $0x01, %rax syscall
Do we really expect this to be faster?
# as -o benchmark2.o benchmark2.S # ld --Bstatic --no-pie benchmark2.o # time ./a.out 0m03.02s real 0m03.03s user 0m00.00s system
Assembler based on C compiler output: 0m03.02s real
The difference is quite obvious, the single op-code version of the bit-shifting code is three times as fast on the test machine. Testing on two more X86 machines, the speed ratio was consistently almost identical at about three to one.
Considering these results, we conclude that the hand assembler optimisation easily wins the synthetic benchmark on all of the test hardware.
Assembler: 1, C compiler: 0
OK, synthetic benchmarks are, afterall, synthetic. Maybe real-world performance would be different. Maybe the %al opcode will stall a pipeline, maybe our extract of assembler code in the second example wasn't typical of what the C compiler would produce in real usage, or perhaps the code just doesn't even get called often enough in putchar32 to even matter.
Fair enough, let's test the theory.
To do this, we'll use /bin/cat to display a very long text file on the terminal. Each character that is drawn and causes a call to putchar32, will invoke our bit-shifting code in the place where we originally put it for the first test of the colour shifting effect in the Candlelit Console patchset.
We'll perform the tests in single user mode, repeat them a total of five times, and discard the timings from the first run because the text file would have been read from disk rather than from the buffer cache.
First, let's test performance without the bit-shifting code. Times listed are in seconds:
Kernel without bitshifting code: Run # real user system notes 1 5.40 0 5.34 values discarded 2 5.27 0 5.28 3 5.27 0 5.28 4 5.32 0 5.33 5 5.37 0 5.38 Low: 5.27 0 5.28 High: 5.37 0 5.38 Mean average: 5.3075 0 5.3175
Now, with the inline assembler code added to putchar32:
Kernel with asm bitshifting code Run # real user system notes 1 5.54 0 5.49 values discarded 2 5.32 0 5.33 3 5.27 0 5.27 4 5.28 0 5.28 5 5.33 0 5.33 Low: 5.27 0 5.27 High: 5.33 0 5.33 Mean average: 5.3000 0 5.3025
Surprisingly, the benchmark runs very slightly faster with the extra opcode in place! The conclusion that we can probably draw from this result is that the difference is so small that it is within the margin of error that we are measuring on subsequent runs of the code. In other words, it's `in the noise'.
Now, we'll test the C version of the code:
Kernel with C bitshifting code Run # real user system notes 1 5.40 0 5.33 values discarded 2 5.30 0 5.31 3 5.51 0 5.51 4 5.43 0 5.43 5 5.52 0 5.54 Low: 5.30 0 5.31 High: 5.52 0 5.54 Mean average: 5.4400 0 5.4475
The C implementation genuinely seems to be slower by about 2.5%.
Our quick benchmarks seem to show that there really is performance to be gained by using hand-written in-line assembler in this particular case.
It's important to note, though, that moving the code from the putchar function to the device colour map initialisation function effectively makes this a moot point, as the bit-shifting is then no longer being done on a character by character basis, and so performance of the frequently called putchar function should remain unaffected.
In general, micro-optimisations are best avoided if they detract in any significant way from the readability, clarity, or quality of the source code. In the case of the Candlelit Console patchset, a different approach eliminates the need for the in-line assembly, and we get the best of both worlds, (performance and clearer source code). We left the assembly code in the final version of the patch mainly as an example for readers who are learning about kernel programming by reading it.
Nevertheless, there are times when absolute performance or code compactness is required, and in those moments, there is often no substitute for hand written assembler by experienced programmers. You know where to find us!
Assembler: 2, C compiler: 0
=> Home page of the Exotic Silicon gemini capsule. | Your use of this gemini capsule is subject to the terms and conditions of use.
Copyright 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 Exotic Silicon. All rights reserved.
text/gemini; charset=utf-8
This content has been proxied by September (ba2dc).