I recently had the unfortunate experience of having to reverse engineer a large Erlang application. So, when I saw haxrob's entries [1] to BGGP5[2], I decided to write some Erlang of my own. This blog post will start out with an improvement to haxrob's submission which saves 24 bytes, and then it will move on to BEAM bytecode golfing.
-module(y). -export([s/0]). s()->ssl:start(),inets:start(),{_,{_,_,B}}=httpc:request("https://binary.golf/5/5"),io:put_chars(B). ----- output ----- $ erlc y.erl $ erl -noshell -s y s Another #BGGP5 download!! @binarygolf https://binary.golf
-module(q). -export([q/0]). q()->ssl:start(),inets:start(),q:q(httpc:request("http://binary.golf/5/5")). ----- output ----- $ erlc q.erl $ erl -noshell -s q q {"init terminating in do_boot",{undef,[{q,q,[{ok,{{"HTTP/1.1",200,"OK"},[{"cache-control","max-age=600"},{"connection","keep-alive"},{"via","1.1 varnish"},{"accept-ranges","bytes"},{"age","0"},{"server","GitHub.com"},{"vary","Accept-Encoding"},{"content-length","58"},{"content-type","application/octet-stream"},{"last-modified","Fri, 21 Jun 2024 13:56:50 GMT"},{"access-control-allow-origin","*"},{"x-proxy-cache","MISS"},{"x-cache","HIT"},{"x-cache-hits","0"}],"Another #BGGP5 download!! @binarygolf https://binary.golf\n"}}],[]},{init,start_em,1,[]},{init,do_boot,3,[]}]}} init terminating in do_boot ({undef,[{q,q,[{_}],[]},{init,start_em,1,[]},{init,do_boot,3,[]}]})
Both versions start out the same. The -module attribute is required for all Erlang modules, and the -export attribute is what allows us to actually call the function. Both modules also define a single-letter function name, and start by calling ssl:start() and inets:start(). Both modules also use httpc:request to download the web page. haxrob's submission extracts the body of the request and display it using io:putchars, and while this is a short way to generate a clean output, it costs quite a few bytes. As it turns out, if you try to call an undefined external function, the error message will nicely format the arguments. That means displaying the message is as simple as wrapping the httpc:request() call in an invalid call to a nonexistent function, such as q:q/1 in my submission. Finally, httpc follows redirects so I was able to save a byte by replacing "https" with "http" in the url.
The file produced by erlc is a BEAM bytecode file. The format of this file is based on "EA IFF 85", which is a generic TLV container format consisting of multiple sections or "chunks". Each chunk is identified by a 32-bit tag and can be up to UINT32_MAX bytes long. There is an additional "FOR1"/"FORM" header at the beginning of the file, which contains the length of the entire file.
The .beam file produced by erlc contains a signifcant amount of unneeded information used for debugging and metadata. The BEAM format[3][4] seemed simple enough, so I decided to start by trying to remove unneeded sections. After some experimentation, I found that the program only required 6 sections to function. This worked quite well, and I was immediately able to reduce the file from 740 to 384 bytes.
import gzip, sys def align4(n): return (n+3)&0xfffffffc def repack(secs): def wdword(n, length=4): return n.to_bytes(length=length, byteorder='big') body=b'BEAM' for i in secs: body+=i[0]+wdword(i[1])+i[2] return b'FOR1'+wdword(len(body))+body fd=open(sys.argv[1],"rb") # BEAM modified FORM header assert b'FOR1' == fd.read(4) fd.read(4) assert b'BEAM' == fd.read(4) secs=[] while True: tag=fd.read(4) if tag==b'': break length=int.from_bytes(fd.read(4), byteorder='big') body=fd.read(align4(length)) assert len(body)==align4(length) if tag in [b'AtU8', b"Code", b"ImpT", b"ExpT", b"StrT", b"LitT"]: print(f"Keeping tag {tag.decode()} [{length}]") secs.append((tag, length, body)) else: print(f"Removing tag {tag.decode()} [{length}]") rp=repack(secs) print(f"Original size: {fd.tell()}") print(f"New size: {len(rp)}") if len(sys.argv)>=3: open(sys.argv[2],"wb").write(rp) ----- $ python3 beamstrip.py q.beam q.beam.stripped Keeping tag AtU8 [71] Keeping tag Code [86] Keeping tag StrT [0] Keeping tag ImpT [76] Keeping tag ExpT [40] Keeping tag LitT [46] Removing tag Meta [29] Removing tag LocT [4] Removing tag Attr [40] Removing tag CInf [83] Removing tag Dbgi [88] Removing tag Line [21] Removing tag Type [26] Original size: 740 New size: 384
While looking at the stripped file, I noticed some extra functions had been added by the compiler. The module_info/0 and module_info/1 functions are wrappers around erlang:get_module_info, and are not needed to run the program. I was able to ask the compiler to create an Erlang assembly file using the -S flag
$ erlc -S q.erl $ cat q.S {module, q}. %% version = 0 {exports, [{module_info,0},{module_info,1},{q,0}]}. {attributes, []}. {labels, 7}. {function, q, 0, 2}. {label,1}. {line,[{location,"q.erl",3}]}. {func_info,{atom,q},{atom,q},0}. {label,2}. {allocate,0,0}. {call_ext,0,{extfunc,ssl,start,0}}. {call_ext,0,{extfunc,inets,start,0}}. {move,{literal,"http://binary.golf/5/5"},{x,0}}. {call_ext,1,{extfunc,httpc,request,1}}. {call_ext_last,1,{extfunc,q,q,1},0}. {function, module_info, 0, 4}. {label,3}. {line,[]}. {func_info,{atom,q},{atom,module_info},0}. {label,4}. {move,{atom,q},{x,0}}. {call_ext_only,1,{extfunc,erlang,get_module_info,1}}. {function, module_info, 1, 6}. {label,5}. {line,[]}. {func_info,{atom,q},{atom,module_info},1}. {label,6}. {move,{x,0},{x,1}}. {move,{atom,q},{x,0}}. {call_ext_only,2,{extfunc,erlang,get_module_info,2}}.
I removed all code from the first module_info declaration down, and removed these functions from the exports table at the top. I also experimented with removing other instructions, but was only able to remove the line instruction directly below the first label. I also adjusted the labels attribute, although this doesn't seem to be required. This resulted in the follwing assembly file:
{module, q}. %% version = 0 {exports, [{q,0}]}. {attributes, []}. {labels, 3}. {function, q, 0, 2}. {label,1}. {func_info,{atom,q},{atom,q},0}. {label,2}. {allocate,0,0}. {call_ext,0,{extfunc,ssl,start,0}}. {call_ext,0,{extfunc,inets,start,0}}. {move,{literal,"http://binary.golf/5/5"},{x,0}}. {call_ext,1,{extfunc,httpc,request,1}}. {call_ext_last,1,{extfunc,q,q,1},0}.
This compiles to a 592 byte file, which goes down to 264 bytes after stripping unused sections:
$ erlc q.S $ python3 beamstrip.py q.beam q.beam Keeping tag AtU8 [36] Keeping tag Code [49] Keeping tag StrT [0] Keeping tag ImpT [52] Keeping tag ExpT [16] Keeping tag LitT [46] Removing tag LocT [4] Removing tag Attr [40] Removing tag CInf [97] Removing tag Dbgi [88] Removing tag Line [20] Removing tag Type [26] Original size: 592 New size: 264
At this point, I had optimization ideas that didn't seem possible with compiler-enforced validation. Most of the sections are fairly easy to generate. Each section begins with a 4-byte tag, followed by a 4-byte length, and 4-byte-aligned payload.
The Atom chunk contains the names of each atom used by the program. Atoms are special strings such as module names and function names. When referencing a function such as ssl:start/0, the Atom section will contain "ssl" and "start", and these strings will be referenced by an index into the atom table. Each atom is a length-prefixed string up to 256 bytes, and the total count of atoms is stored at the start of the chunk. The Atom chunk was historically identified by the "Atom" tag, but this seems to have been superceded by the "AtU8" tag, indicating a UTF-8 Atom table. Lastly, the first entry in the Atom table must be the module name corresponding to the beam file.
This section contains certain string literals. You may assume that this would be where the "https://binary.golf/5/5" string would be stored, but in fact that's actually stored in the literal table (section 3.5). The String Table must be present, but can be empty in this case. An empty string table is simply b"StrT\0\0\0\0"
.
Erlang function signatures are in the form [module]:[function]/arity
. The module and function name are stored in the Atom table, and the arity is an integer corresponding to the number of arguments the function expects. The Import Table begins with the count of imports, followed by a sequence of (module, function, arity) fields (32-bit per field, or 12 bytes per import). The module and function names are indices into the atom table, starting at 1 for the first element.
The Export Table is similar to the Import Table, but instead of storing the module name, the label of the function entrypoint is stored. The section starts with a count of exports, followed by a sequence of (function, arity, label) fields (32-bit per field, or 12 bytes per export).
The Literal Table consists of an Erlang External Term Format (ETF) encoded list of literals. Unlike the string table or atom table, ETF encodes type information as well as value. The encoded data is compressed with Zlib, and the uncompressed size is stored alongside the compressed data. This already encoded only a single literal (the URL), and I found no way to shrink it further, so I simply copied the raw body of the literal table from my previous .beam files.
The Code section is much more complicated than other sections, and I found The BEAM Book [6] to be very helpful for understanding the instruction format. The code section begins with a header containing the following 32-bit fields:
This header is immediately followed by the bytecode.
The first optimization focused on the imports table. I realized I could remove the import of q:q/1
by calling httpc:request/1 with the non-string HTTP response returned by httpc:request/1. The error message contained the response headers and body just like the previous error message. This also allowed me to remove the q
atom from the atoms table by renaming the module and main function to match other atoms. I went with renaming the main function to start:inets/0
.
Next, I worked on optimizing the bytecode, and I was able to save a byte by replacing the call_ext_last at the end of the function with a call_ext, and removing the allocate instruction. The compiler enforced the consistency of the memory allocation, but the BEAM VM doesn't seem to care.
These optimizations bring the file size down to 248 bytes.
After doing all of this work, I found out that .beam files support compression. Gzipping the file saves another 41 bytes, bringing the final size down to 207 bytes.
import gzip def wdword(n, length=4): return n.to_bytes(length=length, byteorder='big') def mk_atom(atoms): atb=wdword(len(atoms)) for i in atoms: atb+=wdword(len(i), 1) atb+=i atom=b'AtU8'+wdword(len(atb))+atb return atom def mk_code(instrs): codehdr=b'' codehdr+=wdword(16) # infosz?? codehdr+=wdword(0) # version codehdr+=wdword(169) # opcode_max, could be set lower codehdr+=wdword(3) # labels codehdr+=wdword(1) # fcount return b'Code'+wdword(len(codehdr)+len(instrs))+codehdr+instrs def mk_impt(imps, atoms): def parse_imp_sig(imp): mod, meth=imp.split(":") meth, arity=meth.split("/") return mod.encode(), meth.encode(), int(arity) data=wdword(len(imps)) for i in imps: mod, meth, arity=parse_imp_sig(i) data+=wdword(atoms.index(mod)+1) data+=wdword(atoms.index(meth)+1) data+=wdword(arity) impt=b'ImpT'+wdword(len(data))+data return impt def mk_expt(exps, atoms): data=wdword(len(exps)) for i in exps: data+=wdword(atoms.index(i[0])+1) data+=wdword(i[1]) data+=wdword(i[2]) expt=b'ExpT'+wdword(len(data))+data return expt # all args must be constant values in range(0,16) def inst(op, *args): x=bytearray() x.append(op) for i in args: assert 0<=i<16 x.append(i<<4) return bytes(x) def call(name, funcs): n=int(name.split("/")[1]) fn=funcs.index(name) return inst(7, n, fn) def func_info(mod, func, arity, atoms): m=atoms.index(mod)+1 f=atoms.index(func)+1 data=bytearray() data.append(2) data.append((m<<4)|2) data.append((f<<4)|2) data.append(arity<<4) return bytes(data) def bytecode(atoms, imports, exports): b=b'' # main() b+=inst(1, 1) # label 1 b+=func_info(b'start', b'inets', 0, atoms) b+=inst(1, 2) # label 2 b+=call('ssl:start/0', imports) b+=call('inets:start/0', imports) mov =b'\x40' # MOV mov+=b'\x47' # LITERAL mov+=b'\x00' # literal idx mov+=b'\x03' # reg X0 b+=mov # mov(lit[0], X0) b+=call('httpc:request/1', imports) b+=call('httpc:request/1', imports) b+=b'\x03' # int_code_end return b def pad4(data): if len(data)&3 ==0: return data return data + b'\0'*(4-(len(data)&3)) def create(): # module: start # function: inets atoms=[b'start', b'ssl', b'inets', b'httpc', b'request'] imports=["ssl:start/0", "inets:start/0", "httpc:request/1"] exports=[(b"inets", 0, 2)] # compressed literal table, borrowed from q.beam litt=bytes.fromhex("789c63606060646060906ace6610cb282929b0d2d74fcacc4b2caad44bcfcf49d337d53705007842089b") data=b'BEAM' data+=pad4(mk_atom(atoms)) data+=pad4(mk_code(bytecode(atoms, imports, exports))) data+=b'StrT'+b'\0\0\0\0' # empty string table data+=pad4(mk_impt(imports, atoms)) data+=pad4(mk_expt(exports, atoms)) data+=pad4(b'LitT'+wdword(len(litt)+4)+wdword(0x22)+litt) return b'FOR1'+wdword(len(data))+data with open("start.beam", "wb") as fd: fd.write(gzip.compress(create()))
$ sha256sum start.beam ace72aaa0f9c546e4249e974520e71df55973b8ffbff347a193b237a090236fa start.beam $ erl -noshell -s start inets Description: "Server authenticity is not verified since certificate path validation is not enabled" Reason: "The option {verify, verify_peer} and one of the options 'cacertfile' or 'cacerts' are required to enable this." {"init terminating in do_boot",{badarg,[{unicode,characters_to_list,[{ok,{{"HTTP/1.1",200,"OK"},[{"cache-control","max-age=600"},{"connection","keep-alive"},{"via","1.1 varnish"},{"accept-ranges","bytes"},{"age","0"},{"server","GitHub.com"},{"vary","Accept-Encoding"},{"content-length","58"},{"content-type","application/octet-stream"},{"last-modified","Fri, 21 Jun 2024 13:56:50 GMT"},{"access-control-allow-origin","*"},{"x-proxy-cache","MISS"},{"x-cache","MISS"},{"x-cache-hits","0"}],"Another #BGGP5 download!! @binarygolf https://binary.golf\n"}}],[{file,"unicode.erl"},{line,895},{error_info,#{module=>erl_stdlib_errors}}]},{httpc,normalize_and_parse_url,1,[{file,"httpc.erl"},{line,742}]},{httpc,do_request,5,[{file,"httpc.erl"},{line,257}]},{start,inets,0,[]},{init,start_em,1,[]},{init,do_boot,3,[]}]}} init terminating in do_boot ({badarg,[{unicode,characters_to_list,[{_}],[{_},{_},{_}]},{httpc,normalize_and_parse_url,1,[{_},{_}]},{httpc,do_request,5,[{_},{_}]},{start,inets,0,[]},{init,start_em,1,[]},{init,do_boot,3,[]}]}) Crash dump is being written to: erl_crash.dump...done
Some headers have been removed from the output for privacy.
464f5231 | "FOR1" FORM header 000000f0 | Filesize 4245414d | "BEAM" magic -- Atom table -- 41745538 | "AtU8" atom table tag 00000022 | size of atom table 00000005 | number of atoms 05 7374617274 | "start" 03 73736c | "ssl" 05 696e657473 | "inets" 05 6874747063 | "httpc" 07 72657175657374 | "request" 0000 | padding -- Code section -- 436f6465 | "Code" tag 0000002d | Size of code section 00000010 | Remaining size of Code header 00000000 | Bytecode version 0 000000a9 | highest bytecode opcode 00000003 | Number of labels 00000001 | Number of functions 0110 | {label,1} 02123200 | {func_info,{atom,start},{atom,inets},0} 0120 | {label,2} 070000 | {call_ext,0,{extfunc,ssl,start,0}} 070010 | {call_ext,0,{extfunc,inets,start,0}} 40470003 | {mov,{literal,"https://binary.golf/5/5"},{x,0}} 071020 | {call_ext,1,{extfunc,httpc,request,1}} 071020 | {call_ext,1,{extfunc,httpc,request,1}} 03 | {int_code_end} 000000 | padding -- String Table -- 53747254 | "StrT" String Table tag 00000000 | String Table size -- Import Table -- 496d7054 | "ImpT Import Table tag 0000 0028 | Size of import table 0000 0003 | Number of imports [ Module | Function | Arity ] 00000002 00000001 00000000 | ssl:start/0 00000003 00000001 00000000 | inets:start/0 00000004 00000005 00000001 | httpc:request/0 -- Export Table -- 45787054 | "ExpT" Export Table tag 00000010 | Export Table size 00000001 | Number of exports [ Function | Arity | Label ] 00000003 00000000 00000002 | start:inets/0 @ label 2 (module name implied) -- Literal Table -- 4c697454 | "LitT" Literal Table tag 0000002f | Size of literal table 00000022 |Uncompressed size of literal table 789c63606060646060906ace6610 # cb282929b0d2d74fcacc4b2caad4 #-| Compressed literal table 4bcfcf49d337d53705007842089b # 0000 | padding
=> [1] haxrob's .erl BGGP5 Entry This content has been proxied by September (ba2dc).
=> [2] Binary Golf Grand Prix 5
=> [3] BEAM File Format
=> [4] File Format for BEAM R5 and Later
=> [5] Erlang BEAM VM Specification
=> [6] The Beam BookProxy Information
text/gemini;lang=en-US