Ancestors

Written by Rob Napier on 2024-10-19 at 12:28

Yesterday I finally solved the third hardest discrete bug I’ve worked on in my career. Took about 3.5 weeks of work. At the end, I spent 10 consecutive days doing nothing but investigating this bug, all day, every day. I then took a day off to pick up my folks in Pennsylvania because they wrecked their car (everyone is fine, just car damage). Came back, and yesterday I finally found it.

1/

=> More informations about this toot | More toots from cocoaphony@mastodon.social

Written by Rob Napier on 2024-10-19 at 12:32

The bug was in Android code that worked fine on arm64 and x86, but crashed on x86_64. Heisenbug: adding logs would change its behavior. All the relevant bits were deep in C++ code, 3 layers of modules from the Java, so a debugger is basically impossible. Just logs.

Different modules have to be built on different machines with different build systems and assembled by hacking .so files into the APK by hand. And the results had to be tested on a Windows box for x86_64.

2/

=> More informations about this toot | More toots from cocoaphony@mastodon.social

Toot

Written by Rob Napier on 2024-10-19 at 12:36

Logs indicated that on x86_64, even incredibly simple function calls would corrupt their parameters. Local, constant strings would log as garbage, and sometimes just logging them would crash the app.

I figured it was a compiler setting mismatch. Maybe something like Microsoft parameter passing conventions, though clearly it wasn’t that. Checked for weird 64/32-bit mismatches. Maybe an NDK mismatch. Found a mismatch on the version of Android targeted, but that wasn’t it.

3/

=> More informations about this toot | More toots from cocoaphony@mastodon.social

Descendants

Written by Rob Napier on 2024-10-19 at 12:44

Finally realized the Heisenbug nature. Code that worked stopped working when I added logs.

Tried calling __android_log_print directly rather than using the LOGE macro and the corruption went away.

It was the logger the entire time.

The default Log() function that came with the 3rdparty code passed a va_list to printf(). It has been modified to also pass the va_list to android_log_print. Reusing a va_list is not legal. On x64 it led to crashes.

Quick fix.

/fin

=> More informations about this toot | More toots from cocoaphony@mastodon.social

Written by Miguel de Icaza ᯅ🍉 on 2024-10-19 at 12:48

@cocoaphony Jesus fucking Christ.

=> More informations about this toot | More toots from Migueldeicaza@mastodon.social

Written by Martin De Wulf on 2024-10-19 at 12:51

@cocoaphony good catch!

=> More informations about this toot | More toots from madewulf@mastodon.social

Written by John Gordon on 2024-10-19 at 12:56

@cocoaphony Great description! What did it tell you about the testing of that 3rd party code?

=> More informations about this toot | More toots from jgordon@appdot.net

Written by Rob Napier on 2024-10-19 at 13:04

@jgordon well, the 3rdparty code itself was fine. The bug was in a patch we wrote to make it log on Android. And it does work, but it’s undefined behavior. And that undefined behavior doesn’t work on one processor that’s very rarely used in phones. So it originally was reported as an Android Auto bug. Took a long time to realize it was the architecture not the OS.

=> More informations about this toot | More toots from cocoaphony@mastodon.social

Written by Sean Heber on 2024-10-19 at 14:57

@cocoaphony woooow!

=> More informations about this toot | More toots from bigzaphod@mastodon.social

Written by Craig Hockenberry on 2024-10-19 at 15:23

@cocoaphony DID YOU TRY UNPLUGGING IT AND PLUGGING IT AGAIN CONGRATS

=> More informations about this toot | More toots from chockenberry@mastodon.social

Proxy Information
Original URL
gemini://mastogem.picasoft.net/thread/113334116927988175
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
477.57756 milliseconds
Gemini-to-HTML Time
1.744512 milliseconds

This content has been proxied by September (ba2dc).