Another confirmed case of bit rot came to my attention on the dpreview forum:
https://www.dpreview.com/forums/post/63994949Previous related discussion
@koshy's
A710 and
refresh ROM discussion.
A dpreview user had a used, seemingly functional SX260HS 100b, except that the jogdial did not work to adjust Tv, Av etc. They initially weren't certain if this was a setting or normal behavior, but there is no other Canon UI to adjust these values.
Interestingly, they tried CHDK, and found the jogdial did work to navigate the CHDK menu
and reasonably concluded this meant the hardware was OK.
My initial attempt to diagnose this was
1) Check that the MMIO values seemed normal. I thought maybe failing hardware would jitter or give larger jumps in values in a way that the Canon firmware didn't like but still worked in CHDK. However, the values seemed totally normal, incrementing cleanly by 1 for each dial step and not changing when the dial was not in use.
2) Check if CHDK wheel_* functions worked to send jogdial input to the Canon firmware. If they worked, it would be a potential workaround, even if the underlying problem were not understood. The user reported these did not work. This turned out to be a red herring because the wheel_ functions were
broken in the CHDK port.
This didn't really make sense to me, but suggested a firmware problem, so I got them to send me a firmware dump. Comparing to the 100b from the dumps repository using HxD, I found a single bit difference at offset 4e2d4: 0x0e in the reference dump was 0x2e in the users dump.
This offset is low enough to be clearly in the code that shouldn't vary in cameras with the same Canon firmware, and the string RotaryEncoder is nearby
The next differences are a long run of values starting at 0xae00a8, which is high enough to probably be one of the normally varying areas, and confirms the rest of the main code is probably fine.
Disassembling from ff04e2d0
reference dump:
ldr r2, =0xff04de44
mov r1, #0xe
mov r0, r3
bl sub_0068a7bc ; RegisterInterruptHandler
users dump:
ldr r2, =0xff04de44
mov r1, #0x2e
mov r0, r3
bl sub_0068a7bc ; RegisterInterruptHandler
In other words, an interrupt handler related to the dial gets registered for the wrong (probably non-existent) interrupt 0x2e, so it never sees any interrupts. CHDK is unaffected, because we just poll the MMIO. (side note, I had no idea the stock firmware jogdial stuff was interrupt driven. That would provide an alternate way to block or intercept jogdial stuff)
That code is in the function ff04e23c, which is the first call from task_RotaryEncoder, clearly an initialization function. The RotaryEncoder string seen nearby is the task name.
Since we already hook this task for jogdial support, it's trivial* to work around by including the correct code of that sub in the jogdial task hook. Patch is attached for future reference (the only real change is in code_gen.txt)
Currently, the user is opting to just use the CHDK patch, not attempting to fix the ROM.
Fixing the ROM seems like it should be straightforward, but I see two potential concerns
1)
My understanding is that writing flash involves an erase, write cycle of a larger block. (edit: see srsa_4c comment below) If any of the code in the affected block were to be accessed while that was in progress, it seems like Bad Things could happen. If the firmware locks out interrupts and task switching for the duration, it should be safe as long as it's not near the actual flash writing code.
2) If the flash itself is actually failing rather than just faded, one can imagine trying to write it might make things worse, like the whole block getting corrupted, or the camera detecting a hardware error in the write process.
Some final observations:
Bit rot is real. If a camera is doing something weird, comparing firmware dumps is a good idea.
SX260 was released in 2012, while A710 was 2006, so even without knowing exactly when each was manufactured or when the failure occurred, this is likely a lower time to failure than we saw before. 8 years doesn't seem great.
This was a remarkably "lucky" failure. I initially dismissed firmware corruption as unlikely, because I expected it to be much more likely to cause a crash than cleanly disabling a specific feature. In fact, I think if 0x200 or higher had flipped, this would have triggered an assert due to RegisterInterruptHandler failing, and if it had been low enough to conflict with another used interrupt, that would also potentially be catastrophic.
A camera that crashed on boot would most likely be thrown out and never come to our attention. Conversely, if you encounter a camera that doesn't appear to boot, it's probably worth throwing in a diskboot.
A convenient way to check known-constant parts of the firmware would have value. This could be automated to some extent using the python library I made for the Ghidra memory map plugin.
Caveat: We don't actually know the the dumps in the repository aren't corrupted, only that they were functional enough to make a dump!
* Although the workaround is trivial, I still managed to mess it up once: I neglected to set the correct length in the codegen FUNC command, which resulted in the preceding loop using a sub_ instead of loc_ and bypassing the fix completely