A Bug Hunting Story: How Album Artwork Caused a System-Wide Loss of Audio
• Chris Liscio
• Chris Liscio
Capo was not playing audio for some users on Intel Macs running Sonoma. After spending almost two weeks (and about $850) I discovered that macOS Sonoma had a rather nasty bug that was triggered by loading JPEG images.
On October 3, 2023, a long-time customer emailed support to inform us that—after upgrading to macOS Sonoma—Capo was no longer playing audio. Even worse: once Capo stopped playing audio, nothing else would produce sound on his Mac.
Unfortunately, I could not replicate this on my own machine. While requesting details about the user’s configuration, two more reports arrived. All the users had one thing in common: Intel CPUs, and macOS Sonoma. Unfortunately, my newest Intel-based test system was cut off last year when I couldn’t install macOS Ventura on it. So I rushed to the local buy & sell website and acquired a second-hand MacBook Pro 16” with an i9 CPU and 16GB RAM.
It didn’t take long to reproduce the bug, and it was nasty: I press play, and the playhead stays still. Pause, play again, and the playhead moves. Still, no sound. Strangely, the problem was limited to AAC-compressed m4a audio files, but only some of them. I failed to identify what was special about those files that refused to play—they all had the same audio format as the ones that worked correctly (stereo, 44.1kHz, etc.), and digging through the output of afinfo
didn’t reveal anything concerning. Since this was an audio bug, that’s where my scrutiny of the files ended.
I started by dumping the sample values coming out of Capo’s audio engine, and noticed it was spitting out NaNs. That explains why everything was quiet, and—on some machines—took down the system’s audio. I traced the source of the invalid samples to my audio slowing engine: a third-party library that has been trouble-free for more than 10 years. When I bypassed its processing, the values were good. But the library’s not at fault: it works fine with other audio files, and consumes raw samples long after they’re decoded.
A couple of days later, I had my first breakthrough. I discovered that the audio slowing library—just before processing audio—was reporting its playback rate parameter value as NaN, and not 1.0. This sounded an awful lot like memory corruption, and I had to find out who was writing the bad value.
I wanted to set a watchpoint in the debugger to find the source of corruption, but I’m working with a binary, closed-source library. To find the address where the playback rate was stored in memory, I had to dig through the library’s disassembly. I placed a watchpoint at this address, and was disappointed to learn that it triggered only when my code called the library’s routine to set the playback rate. Nobody else was trampling on the library’s memory.
When the user presses play, Capo’s audio engine sets the current playback rate value in the library before starting playback, and that’s what triggered the watchpoint. My code called the library to store a value of 1.0 before playback started. But after I set a value of 1.0, shouldn’t the library report that its playback rate is now 1.0? Like, even if the value was corrupted magically by cosmic rays, would it not be the case that setting the value to 1.0 would kick things back into working order?
It turns out that simply asking the library to store a value of 1.0 caused the memory corruption. I set the playback rate to 1.0, and then the library reports its value as NaN. How could this be possible? Isn’t this just a matter of a CPU shifting values between registers and RAM?
When I commented out the line of code that set the playback rate, audio playback seemed to work again. However, I also noticed something odd: when I asked the library to report its playback rate, it was still NaN. HOW?!
The slowing library’s API declares its parameter values as long double—an 80-bit floating point. It turns out that this is an important detail. Specifically, 80-bit floating point values are stored and loaded from the x87 registers on Intel CPUs, and when the Floating Point Unit (FPU) is left in a bad state, those values get corrupted in transit between RAM and the CPU. So even though I could see a value of 1.0 stored in memory, those bytes were read into an x87 register as a NaN. And when I tried to save a value of 1.0 into memory, it was written correctly to the register, but it got stored into memory as a NaN.
OK, so now I knew what was wrong. But how on earth is the CPU getting into this state? Whose code failed to clean up after itself, and why is this only happening when I load certain m4a audio files?!
My goal now was to figure out when the FPU was left in a bad state, and hopefully I could identify the culprit from there. To do this, I had to find two points in time—one where the FPU appeared to be fine, and one where the FPU was reading garbage.
When the audio engine gets initialized, the slowing engine reports its playback speed correctly as 1.0 (Point A). Immediately after the user presses play, the slowing engine reports its playback speed as a NaN (Point B). Unfortunately, those two points in time—in terms of a program running on a modern CPU—were practically light years apart.
Again, I was convinced: this was an audio file format bug. Therefore, I focused my investigation on the other components in Capo that interacted with m4a audio files, and most of these routines live in my Music Information Retrieval (MIR) library. That’s where things like chord recognition, beat tracking, etc. load data from the audio file to perform their processing.
To check on the state of the FPU, I set a breakpoint in Xcode at some point of interest, and when the debugger pauses I copy some memory locations that get dumped at launch. Then, I call the function in the slowing library that prints its current playback speed value. If it reported a 1.0, the FPU was fine, and if the value reported NaN, it was hosed.
I started by trying to move Point B earlier—to find the earliest time that the FPU was in a bad state. I placed a breakpoint after all of Capo’s MIR processing was complete, and the playback rate was NaN. I then tried to move Point A later by placing a breakpoint before the MIR processing started, and the FPU was in a good state. So maybe I was on to something. I continued moving Point A (“the FPU is fine”) forwards through the chain of processing—after the beat detector was done, the key detector, the chord detector—to continue shrinking the distance between Point A and Point B. No matter how close I moved the breakpoint at Point A towards the end of the MIR processing, the FPU was still fine. None of my interactions with audio files seemed to trigger this bug.
By the end of this investigation, my code and breakpoints looked something (roughly) like this:
- (void)signalCompletionToDelegate {
// Point A breakpoint, FPU is OK
dispatch_sync( /* main queue */ ^{
// Point B breakpoint, FPU is BAD!!
// (…rest of the code that calls the delegate)
}
}
Basically, I narrowed the interval between points A and B down to two lines of code. Unfortunately, there is quite a lot that executes in the app between those two lines: the dispatched block runs at a later time. Still, at least I shrank the duration of the interval from lightyears to centuries. Progress!
To determine what was happening between those two lines of code, I placed a pair of signposts in place of the two comments above. That allowed me to examine call stack samples within that region using the Time Profiler in Instruments. In all honesty, I have no clear recollection of exactly how I got from here to identifying the culprit, but a great deal of luck was involved.
First, I knew there was something “floating pointy” that I wanted to look for in the call stack samples within the signposted interval. I went into this with a hunch that maybe something buried in Accelerate might be related. Once I noticed one of the vPlanar*
conversion functions appeared among the samples, things started to heat up.
Those are image-related routines! What the heck do images have to do with audiooooOHMYGOD!?!
What I failed to notice during my initial investigation is that those working m4a audio files had no album artwork! After some poking around, I commented out the line of code that returned an NSImage
to be loaded into the album artwork view at the bottom-left of Capo’s main window. At this point, audio playback worked perfectly!
Somehow, decoding the album artwork that was extracted from these m4a audio files left the FPU in a bad state. Program execution continued fine for a while, but things didn't go bad until the FPU was needed again. The net effect of this was that Capo's audio engine fell apart, even though that code was worlds apart from anything related to album artwork.
Of course, I wasn’t going to “fix” this by disabling album artwork for those users with Intel Macs running macOS Sonoma and later. Not only was that the wrong solution, it was also incomplete: what if images get loaded elsewhere on the user’s Mac while they’re using Capo, and that leaves the FPU in a bad state?
To be safe, I chose the (rather gross) solution of issuing the EMMS instruction myself before I set the playback rate. However, I do so only after confirming that the FPU is in a bad state. The technique I use is similar to that of the sample code I used to demonstrate this bug to Apple in FB13282515.
So that was a wild ride: I started with a serious audio-related bug in Capo, and learned that loading JPEG images on Intel Macs running Sonoma may leave the FPU in a bad state. This journey definitely earns a spot on my list of all-time difficult bugs to track down.
Thanks to Mike Ash for taking a quick look at this before I posted it, and offering some helpful suggestions to improve clarity.