Small Code Size on RP2040 Forth

Every since watching this nifty video on boot sector games, I've been obsessed with the idea of optimizing for code size, i.e., making the compiled code as few bytes as possible.

=> 2022-06-28 - Boot Sector Games

The short summary of the video is that there are some games that have been written which fit entirely within the 512 KB boot sector of a floppy disk, and run without an operating system first being loaded. Not great games, but working games.

Something which challenged me recently was trying to reduce the compiled code size of a small program I wrote in Mecrisp Stellaris Forth. The program was supposed to read the function bits of the RP2040 GPIO CTRL registers, and output the current function that is set for each GPIO pin, in a human readable format. There are nine function options for each pin, and there are 29 pins. The exact function in a slot varies, however, for each pin, e.g., function 1 for GPIO 0 is SPIO RX, but for GPIO 2 it is SPIO SCK, and for GPIO 8 it is SPI1 RX.

=> Screenshot of GPIO functions

Aiming for small code size, it was out of the question to simply have a table of strings, with a full description for each slot, which would require over 1KB of space just for the string table. There is quite a bit of pattern and partial repetition, however, so that it is possible to build the string for a given slot using logic and comparisions, based on the bits in the pin number. This saves a lot of code space in terms of string data, but that benefit is offset somewhat by extra comparisons and bitwise operations that must be performed.

In my initial solution, I tried to generate long, easy to read strings, such as "UART0_RX", but in the end the compiled code size came out to somewhere around 900-1000 bytes, which seemed much too large for a program with such a humble purpose. The compromise I settled on, to avoid abondoning the "human-readable" goal completely, was to have abbreviated strings limited to three characters each. So, "UART0_RX" becomes "U0R". That is not self-explanatory, but it is much easier to associate this with "UART0_RX" than if you were just looking at the raw number code from the register.

Here is the program I came up with:

\ ******************************************************************
\ gpio_diag.fs is part of the mf-rp2040 project
\ SPDX-FileCopyrightText: 2022 Christopher Howard 
\ SPDX-License-Identifier: GPL-3.0-or-later
\
\ Currently provides just the gpio_fns word.
\
\ public words:
\   gpio_fns
\ ******************************************************************

gpio_diag
marker gpio_diag

: pr1/0 if ." 1" else ." 0" then ;

$40014004 constant IO_BANK0_GPIO0_CTRL

\ ******************************************************************
\ This must be a separate word because a do loop cannot jump around
\ this much code (error "jump too far")
\
\ Legend:
\   SXR: SPI RX
\   SXT: SPI TX
\   SXC: SPI CSn
\   SXS: SPI SCK
\   UXR: UART RX
\   UXT: UART TX
\   UXC: UART CTS
\   UXS: UART RTS
\   IXC: I2C SCL
\   IXD: I2C SDA
\   PXA: PWM A
\   PXB: PWM B
\   SIO: SIO
\   PIX : PIO
\   CIX: CLOCK GPIN
\   COX: CLOCK GPOUT
\   NFC: NO FUNCTION
\   UOD: USB OVCUR DET
\   UVD: USB VBUS DET
\   UVE: USB VBUS EN
\ ******************************************************************
: gpio_fns' ( pin fn# -- )
    case 1 of ." S" dup %1000 and pr1/0
            %11 and case 0 of ." R"
                endof 1 of ." C" endof %10 of ." S" endof ." T"
            endcase endof
        2 of ." U" dup 3 > over 12 < and over dup 19 > swap 28 <
            and or pr1/0 %11 and case 0 of ." T" endof
                1 of ." R" endof %10 of ." C" endof ." S"
            endcase endof
        3 of ." I" dup %10 and pr1/0
            %1 and if ." C" else ." D" then endof
        4 of ." P" dup 1 rshift %111 and $30 + emit
            %1 and if ." B" else ." A" then endof
        5 of drop ." SIO" endof
        6 of drop ." PI0" endof
        7 of drop ." PI1" endof
        8 of ." C"
            case 20 of ." I0" endof
                21 of ." O0" endof
                22 of ." I1" endof
                23 of ." O1" endof
                24 of ." O2" endof
                25 of ." O3" endof
                ." NFC" endcase endof
        9 of ." U" 3 mod case
                0 of ." OD" endof
                1 of ." VD" endof
                ." VE" endcase endof endcase ;

\ ******************************************************************
\ Print out the function currently set for each pin
\ ******************************************************************
: gpio_fns ( -- )
    30 0 do
        i . i 8 * IO_BANK0_GPIO0_CTRL + @ %11111 and i swap
        gpio_fns' cr loop ;

behead pr1/0
behead IO_BANK0_GPIO0_CTRL
behead gpio_fns'

The two main words are GPIO_FNS' and GPIO_FNS, which compile to 582 bytes and 80 bytes, with a few more bytes for the other words, as well as the overhead for each entry in the dictionary. That was not a satisfying result, as I feel like this sort of program shouldn't be needing more than 200 or 300 bytes. But I couldn't think of any practical ways to reduce the code size further. I asked on #mecrisp, but nobody there had any ideas either, although somebody asked if Mecrisp Forth had some kind of BETWEEN word which could replace this code:

3 > over 12 < and over dup 19 > swap 28 < and or

But sadly, no such word was available. I could of course define such a word, but since (at present) I need to use it in only one place, it would actually increase the code size to factor it out, due to the dictionary entry overhead.

Something I was really wondering about was if there was some other way to print out the characters which would be more code-size efficient. However, it seemed that using the ." and " words took up less space than all the other methods I knew of. To explain how that works, see this disassembly:

Here is a word which prints out just the letter A, using the ." and " words:

: foo ." A" ;  ok.
see foo 
20020392: B500  push { lr }
20020394: F7E2  bl  20002878  -->  .' A'
20020396: FA70  
20020398: 4101  
2002039A: BD00  pop { pc }
 ok.

Not counting the push and pop, that is four bytes for a branch instruction (no doubt the code which handles the following string). Then one byte for the length of the string (01h) and then the character itself (41h). Here is the same word defined using a simple emit:

: foo $41 emit ; Redefine foo.  ok.
see foo 
200203A6: 3F04  subs r7 #4
200203A8: 603E  str r6 [ r7 #0 ]
200203AA: 2641  movs r6 #41
200203AC: B500  push { lr }
200203AE: F7E2  bl  20002480  --> emit
200203B0: F867  
200203B2: BD00  pop { pc }
 ok.

Not counting the push or pop, we have four bytes for the branch instruction to the EMIT word, and there are six more bytes worth of instructions for dropping $41 on the top of the stack. So, the first approach will be better in terms of code size, even for a single-character string.

I should think that this would be a sensible place for the Mecrisp compiler to do an optimization: say, just put the number in some other scratch register, and call a special version of EMIT which uses that value. But I imagine that optimization is easier to talk about than to code into the compiler.

Another optimization I was wondering about, was if it would be possible to replace the case statement with something that has a smaller footprint. The case statement works by comparing Top Of Stack (TOS) with a case value. If it doesn't match, branch to the next comparison. If it does, keep running code, and eventually there is a branch instruction which takes you to the end of the whole construction. For example:

20020598: 2E06  cmp r6 #6
2002059A: D106  bne 200205AA
2002059C: CF40  ldmia r7 { r6 }
2002059E: CF40  ldmia r7 { r6 }
200205A0: F7E2  bl  20002878  -->  .' PI0'
200205A2: F96A  
200205A4: 5003  
200205A6: 3049  
200205A8: E066  b 20020678

However, in my situation, the case numbers are all in a whole number sequence. So, I should be able to just store an array of vectors to the different code blocks, then use the appropriate vector. I believe this would lower code size overall, since I would basically be adding only a single address for each two CMP and BNE instructions which I removed.

That sounds good, but I don't know how to accomplish that without either rewriting the whole GPIO_FNS' word in assembly, or coming up with a fancy Forth compile-time word which somehow builds the array and all the other branching code. Both ideas sound daunting, but perhaps this could be my next mini-project...

Comments

Alaskalinuxuser, 2022-07-05

I just looked up bootchess, it is only 487kb. Amazing how small a program can be!

Proxy Information
Original URL
gemini://librehacker.com/gemlog/tech/20220702-0.gmi
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
814.318513 milliseconds
Gemini-to-HTML Time
1.118665 milliseconds

This content has been proxied by September (ba2dc).