Another confirmed case of firmware bit rot - General Discussion and Assistance - CHDK Forum

Another confirmed case of firmware bit rot

  • 22 Replies
  • 770 Views
*

Offline reyalp

  • ******
  • 12586
Another confirmed case of firmware bit rot
« on: 07 / June / 2020, 18:43:10 »
Advertisements
Another confirmed case of bit rot came to my attention on the dpreview forum: https://www.dpreview.com/forums/post/63994949

Previous related discussion @koshy's A710 and refresh ROM discussion.

A dpreview user had a used, seemingly functional SX260HS 100b, except that the jogdial did not work to adjust Tv, Av etc. They initially weren't certain if this was a setting or normal behavior, but there is no other Canon UI to adjust these values.

Interestingly, they tried CHDK, and found the jogdial did work to navigate the CHDK menu  ??? and reasonably concluded this meant the hardware was OK.

My initial attempt to diagnose this was
1) Check that the MMIO values seemed normal. I thought maybe failing hardware would jitter or give larger jumps in values in a way that the Canon firmware didn't like but still worked in CHDK. However, the values seemed totally normal, incrementing cleanly by 1 for each dial step and not changing when the dial was not in use.

2) Check if CHDK wheel_* functions worked to send jogdial input to the Canon firmware. If they worked, it would be a potential workaround, even if the underlying problem were not understood. The user reported these did not work. This turned out to be a red herring because the wheel_ functions were broken in the CHDK port.

This didn't really make sense to me, but suggested a firmware problem, so I got them to send me a firmware dump. Comparing to the 100b from the dumps repository using HxD, I found a single bit difference at offset 4e2d4: 0x0e in the reference dump was 0x2e in the users dump.

This offset is low enough to be clearly in the code that shouldn't vary in cameras with the same Canon firmware, and the string RotaryEncoder is nearby :o

The next differences are a long run of values starting at 0xae00a8, which is high enough to probably be one of the normally varying areas, and confirms the rest of the main code is probably fine.

Disassembling from ff04e2d0
reference dump:
Code: [Select]
    ldr     r2, =0xff04de44
    mov     r1, #0xe
    mov     r0, r3
    bl      sub_0068a7bc ; RegisterInterruptHandler
users dump:
Code: [Select]
    ldr     r2, =0xff04de44
    mov     r1, #0x2e
    mov     r0, r3
    bl      sub_0068a7bc ; RegisterInterruptHandler
In other words, an interrupt handler related to the dial gets registered for the wrong (probably non-existent) interrupt 0x2e, so it never sees any interrupts. CHDK is unaffected, because we just poll the MMIO. (side note, I had no idea the stock firmware jogdial stuff was interrupt driven. That would provide an alternate way to block or intercept jogdial stuff)

That code is in the function ff04e23c, which is the first call from task_RotaryEncoder, clearly an initialization function. The RotaryEncoder string seen nearby is the task name.

Since we already hook this task for jogdial support, it's trivial* to work around by including the correct code of that sub in the jogdial task hook. Patch is attached for future reference (the only real change is in code_gen.txt)

Currently, the user is opting to just use the CHDK patch, not attempting to fix the ROM.

Fixing the ROM seems like it should be straightforward, but I see two potential concerns
1) My understanding is that writing flash involves an erase, write cycle of a larger block. (edit: see srsa_4c comment below) If any of the code in the affected block were to be accessed while that was in progress, it seems like Bad Things could happen. If the firmware locks out interrupts and task switching for the duration, it should be safe as long as it's not near the actual flash writing code.
2) If the flash itself is actually failing rather than just faded, one can imagine trying to write it might make things worse, like the whole block getting corrupted, or the camera detecting a hardware error in the write process.

Some final observations:
Bit rot is real. If a camera is doing something weird, comparing firmware dumps is a good idea.

SX260 was released in 2012, while A710 was 2006, so even without knowing exactly when each was manufactured or when the failure occurred, this is likely a lower time to failure than we saw before. 8 years doesn't seem great.

This was a remarkably "lucky" failure. I initially dismissed firmware corruption as unlikely, because I expected it to be much more likely to cause a crash than cleanly disabling a specific feature. In fact, I think if 0x200 or higher had flipped, this would have triggered an assert due to RegisterInterruptHandler failing, and if it had been low enough to conflict with another used interrupt, that would also potentially be catastrophic.

A camera that crashed on boot would most likely be thrown out and never come to our attention. Conversely, if you encounter a camera that doesn't appear to boot, it's probably worth throwing in a diskboot.

A convenient way to check known-constant parts of the firmware would have value. This could be automated to some extent using the python library I made for the Ghidra memory map plugin.
Caveat: We don't actually know the the dumps in the repository aren't corrupted, only that they were functional enough to make a dump!

* Although the workaround is trivial, I still managed to mess it up once: I neglected to set the correct length in the codegen FUNC command, which resulted in the preceding loop using a sub_ instead of loc_ and bypassing the fix completely  :-[
« Last Edit: 14 / June / 2020, 22:24:05 by reyalp »
Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4228
Re: Another confirmed case of firmware bit rot
« Reply #1 on: 08 / June / 2020, 13:22:59 »
Fixing the ROM seems like it should be straightforward, but I see two potential concerns
1) My understanding is that writing flash involves an erase, write cycle of a larger block.
WriteToRom does not erase the flash. If you're only "refreshing" original content (or the flash area you're writing is already erased, i.e. all bits '1'), you don't need to erase.
Quote
If any of the code in the affected block were to be accessed while that was in progress, it seems like Bad Things could happen. If the firmware locks out interrupts and task switching for the duration, it should be safe as long as it's not near the actual flash writing code.
It looks like flash commands (write enable, write word, etc) are protected with the following:
- interrupts disabled
- dcache disabled
- the routine is copied into and executed from ITCM
There appears to be no protection between flash commands. Flash is written a word (16 bits) at a time.
Quote
2) If the flash itself is actually failing rather than just faded, one can imagine trying to write it might make things worse, like the whole block getting corrupted, or the camera detecting a hardware error in the write process.
If you're not erasing, I would not expect the flash content getting worse.

Quote
A convenient way to check known-constant parts of the firmware would have value.
Checksum of main firmware (a module could be made to do that)? Of course, we'd need to determine those known constant parts of the ROM.
Quote
Caveat: We don't actually know the the dumps in the repository aren't corrupted, only that they were functional enough to make a dump!
True, but I wouldn't expect lots of corrupt cases.

*

Offline Ant

  • ****
  • 440
Re: Another confirmed case of firmware bit rot
« Reply #2 on: 08 / June / 2020, 14:48:31 »
Of course, we'd need to determine those known constant parts of the ROM.
This information(+reference ROM parts) can be extracted from official firmware update file.

*

Offline Caefix

  • ***
  • 213
  • Sorry, busy deleting test shots...
All lifetime is a loan from eternity.


*

Offline reyalp

  • ******
  • 12586
Re: Another confirmed case of firmware bit rot
« Reply #4 on: 08 / June / 2020, 17:02:00 »
IMO, official firmware updates are rare enough I wouldn't want to put a lot of effort into trying to use them as a general check.

However, we can check the majority of the code and some constant data using information already known from the sig finders. We can generate a list of start address, size, checksum for each firmware, including each known constant region. This should be trivial to check in a module.

CheckSumAll in the firmware appears to take a list from a file, but implementing our own should be pretty easy too.

One complication is multiple firmwares that use the same build. We don't necessarily have a dump for each compatible version, and even if the code is identical some constants like the version string vary.

srsa_4c:
Thanks for the info about WriteToRom. That sounds like it should be pretty safe for fixing bit flips.
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 12586
Re: Another confirmed case of firmware bit rot
« Reply #5 on: 13 / June / 2020, 22:37:54 »
Here's a preliminary patch to allow checking some regions of ROM.

tools/make-fw-crc.py is a python script (not ghidra, just regular python3) which takes a platform, sub, and optionally dump directory and uses stubs library I wrote for the ghidra scripts to generate file of
address size crc32
Blocks checked are the main ROM code, initialized data, and code copied to RAM or TCM
example
Code: [Select]
tools/make-fw-crc.py ixus140_elph130 100a -o elph130100a.crc -d d:/chdk/dumps
elph130100a.crc
Code: [Select]
ff000000 4f9478 303c6d1
ff6982d4 2aadc f6971a7
ff684650 13c84 48a9980b

This is far from comprehensive, it misses a lot of constant data, but it's what's easily available from the stubs / memory map code.

modules/formware_crc.c (fwcrc.flt)
Appears in tools menu as "Checksum Canon firmware"
prompts for a file with fselect
checks each of the specified blocks
shows a message box with pass / fail status for each block

It's fairly rough and doesn't do a lot of sanity checking. A maximum of 10 blocks are checked, and it doesn't verify the ranges are sane.

I've successfully tested on sx710, elph130, a540, sx730, g7x, d10, sx160 and elph180

I also verified failure by manually editing the CRC in a file.

My current concept for using this is:
Have an build rule to generate the crc files. These will be manually updated and checked in, like stubs, since they require a the dumps to be present. For cases where we don't have a dump (copied firmwares) it can just be skipped.

In the zip builds, copy the crc file for the specific build into the CHDK tree, with a naming convention that uniquely identifies the firmware, maybe like <PID><SUB>.

Then any user can check their firmware, without having every file for every cam in all the zips.

Alternatively, it could be built into the core, but would need some logic for the copied builds.

I thought about eventually making CHDK automatically check. We wouldn't want to do it every boot since doing the full ROM a noticeable amount of time, but could use a CFG flag so it's only checked on the first boot.

edit:
work-2 patch updates the python script to include the "zico" blobs for D6 cameras, and output the size and percentage of ROM checked. CHDK code is unchanged
« Last Edit: 14 / June / 2020, 16:14:39 by reyalp »
Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4228
Re: Another confirmed case of firmware bit rot
« Reply #6 on: 14 / June / 2020, 16:05:12 »
My current concept for using this is:
Have an build rule to generate the crc files. These will be manually updated and checked in, like stubs, since they require a the dumps to be present. For cases where we don't have a dump (copied firmwares) it can just be skipped.

In the zip builds, copy the crc file for the specific build into the CHDK tree, with a naming convention that uniquely identifies the firmware, maybe like <PID><SUB>.
Sounds good to me, <PID><SUB> would fit in 8 characters which is 8+3 friendly.
Quote
I thought about eventually making CHDK automatically check. We wouldn't want to do it every boot since doing the full ROM a noticeable amount of time, but could use a CFG flag so it's only checked on the first boot.
I did not try this yet so I don't know the computation time, but we'd want to avoid adding another "help screen". Perhaps adding a temporary top-level menu entry would be nicer than a mandatory check (but figuring out how to do that may not be trivial).

The reason I did not try it yet is because the module crashes on my sx280. Vectors 4 and 0xc in physw so far. Happens on launch or after selecting the file.

edit:
Adding
running = 1;
at the start of
Code: [Select]
basic_module_init() seems to allow the module to work.
« Last Edit: 14 / June / 2020, 16:12:44 by srsa_4c »

*

Offline reyalp

  • ******
  • 12586
Re: Another confirmed case of firmware bit rot
« Reply #7 on: 14 / June / 2020, 16:34:19 »
edit:
Adding
running = 1;
at the start of
Code: [Select]
basic_module_init() seems to allow the module to work.
Oops, thanks for catching that. Updated patch attached (no other changes from work-2). Kinda weird it worked on all my cams  :-[

Quote
but we'd want to avoid adding another "help screen". Perhaps adding a temporary top-level menu entry would be nicer than a mandatory check (but figuring out how to do that may not be trivial).
My thought was to not have any UI unless it failed. First startup on a fresh CFG would check (using filename based on port or values compiled in) and only show some UI if it didn't pass. On my cams it seems to take less than a second. What kind of UI to show on failure is not clear though. Users would still be able to invoke manually from the tools menu.

In any case, I think having it purely manual is good for the initial version. Having it be automatic eventually would be nice to give us a better idea how widespread the problem is.
Don't forget what the H stands for.


*

Offline srsa_4c

  • ******
  • 4228
Re: Another confirmed case of firmware bit rot
« Reply #8 on: 14 / June / 2020, 16:51:49 »
My thought was to not have any UI unless it failed. First startup on a fresh CFG would check (using filename based on port or values compiled in) and only show some UI if it didn't pass. On my cams it seems to take less than a second.
I now tried on a DIGIC II cam with 8MB ROM and it seems fast enough - so adding UI does not seem necessary.
Quote
What kind of UI to show on failure is not clear though.
Something like "The camera's firmware appears to be corrupted. Visit the CHDK forum for a fix. Choose OK to make a dump of the firmware."
I guess we won't get hundreds of visitors, so offering help is probably safe.

*

Offline reyalp

  • ******
  • 12586
Re: Another confirmed case of firmware bit rot
« Reply #9 on: 14 / June / 2020, 21:38:03 »
I checked in the initial script and module in r5524. This work-3 with some additional sanity checks and bug fixes.

I'll work on the build and startup stuff later. One thing I was reminded of is many of the early dumps aren't full ROM. I haven't seen any cases yet where identified memory blocks were outside the dump, but I need to make sure the script does the right thing if that happens.

For automatic checking, we'll need to know the actual sub being run on, rather than just what it was built for. A pointer to the string is already in bin_compat.h, so it should be simple to expose it with a DEF in stubs or something.
Don't forget what the H stands for.

 

Related Topics