Another confirmed case of firmware bit rot - page 4 - General Discussion and Assistance - CHDK Forum  

Another confirmed case of firmware bit rot

  • 44 Replies
  • 3473 Views
*

Offline Caefix

  • ****
  • 438
  • Sorry, busy deleting test shots...
Re: Another confirmed case of firmware bit rot
« Reply #30 on: 26 / February / 2021, 16:11:04 »
Advertisements
[] [] [] _ _ _ _  ???
Btw. Is there an option to continue after error? Maybe it´s just a 'typo in a string'.
All lifetime is a loan from eternity.

*

Offline reyalp

  • ******
  • 13141
Re: Another confirmed case of firmware bit rot
« Reply #31 on: 26 / February / 2021, 17:39:13 »
[] [] [] _ _ _ _  ???
Btw. Is there an option to continue after error? Maybe it´s just a 'typo in a string'.
I don't understand. The CHDK CRC check does not have an option NOT to continue. It shows the dialog and you can select to dump the firmware or cancel. You shouldn't see the dialog again unless you reset the CHDK CFG or change the debug->checksum ROM at boot option, although if the camera crashes the CFG might not be saved.

If I understand the users latest post (https://forum.chdk-treff.de/viewtopic.php?f=3&t=3664&p=32014#p32014), the camera crashes without CHDK.

Anyway, this seems to be another case of bit rot firmware corruption (as srsa_4c explains below, probably not just aging). From the dump the Canon firmware appears to be quite badly damaged. I see multiple differences from our g11 100k dump near each of the following addresses, all in the code area of the ROM that should be identical.
Code: [Select]
ff96baa0
ff9d8aa0
ffaaaaa0
ffb1faa0
The next difference is at fffc0040 which is in the settings / log area that normally varies. (edit for clarity: the differences aren't all at those round addresses, they are nearby)

It may be possible to repair, but with so many differences I suspect this ROM may be on its last legs.

edit:
More details
The word at ff96baa4 changed from ffb838c8 to C61030C0
This is a pointer referenced by the function at ff96ba04 related to audio (string "AvDataHandleUIIF.c", called from PT_PlaySound, among other things). It could be hit if menu sounds are enabled.

The instruction at ff9d8aa4 changed from
Code: [Select]
        ff9d8aa4 34  53  9f  e5    ldr        r5,[PTR_DAT_ff9d8de0 ]                           = 000089b0
to
Code: [Select]
        ff9d8aa4 20  01  00  a0    andge      r0,r0,r0, lsr #0x2
This function appears to be associated with movie playback.

The word at ffaaaaa4 is a function pointer, changed from ffaa9954 to 56029040

It appears to be associated with "RecMenu"

The instruction at ffb1faa4 change from
Code: [Select]
        ffb1faa4 c0  0f  9b  e8    ldmia      r11 ,{r6 r7 r8 r9 r10  r11 }=>param_6
to
Code: [Select]
        ffb1faa4 c0  08  1a  00    andeqs     r0,r10 ,r0, asr #0x11
The usage of this code isn't immediately obvious. It calls a function which references "EffectDecode.c"

G11 was released in 2009, making it up to ~12 years old at the time of failure.

edit:
Also notable, the bits in this instance flipped from 1 to 0, where AFAIK the previous ones were the other way.
« Last Edit: 27 / February / 2021, 12:31:32 by reyalp »
Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4396
Re: Another confirmed case of firmware bit rot
« Reply #32 on: 27 / February / 2021, 07:09:39 »
Anyway, this seems to be another case of bit rot.
I - respectfully - disagree. As you noted later
Quote
the bits in this instance flipped from 1 to 0, where AFAIK the previous ones were the other way.
bits have gone from 1 to 0. We're talking about NOR flash. Erased bits are 1, written bits are 0. When deteriorating, trapped electrons escape the cells, and eventually, the affected cell(s) will read as 1 again.
Note that not one of the recognizable corruptions in this case has bits flipped to 1.
So, in my opinion, what we see here is flash write gone wrong. There are several possible causes *, especially since the camera in question was taken apart. I suspect that RAM was corrupted while the camera tried to save its config at shutdown, or maybe there was a power fluctuation during the write process.

As for fixing this, we'd need to erase and reprogram 5 erase blocks (64kB). For that, we'd need to know whether a minimally functional DryOS would access any of those blocks. If any of those blocks is accessed by the firmware while erasing/programming, that would crash the camera and brick it even more.
The 5th corrupted word is at 0xFFB24AA4, affecting a routine that seems connected to movie recording.

We could also fix the corrupted words temporarily, by making a special CHDK build, using cache hacks. That build would have to be in form of a diskboot.bin.

edit:
* Now that I read the owner's first post again, the camera was in working condition after lens replacement. The problems started after "installing" CHDK. So, we might be seeing another "failures suspected to be caused by CHDK" case.
« Last Edit: 27 / February / 2021, 09:48:58 by srsa_4c »

*

Offline Caefix

  • ****
  • 438
  • Sorry, busy deleting test shots...
Re: Another confirmed case of firmware bit rot
« Reply #33 on: 27 / February / 2021, 11:52:32 »
 :( The cam chrashes with the first keypress (but not <func>), (or the first tune?).
Could it help to turn off sounds during booting, and if then how?
« Last Edit: 27 / February / 2021, 11:59:29 by Caefix »
All lifetime is a loan from eternity.


*

Offline reyalp

  • ******
  • 13141
Re: Another confirmed case of firmware bit rot
« Reply #34 on: 27 / February / 2021, 14:01:10 »
I - respectfully - disagree. As you noted later
Quote
the bits in this instance flipped from 1 to 0, where AFAIK the previous ones were the other way.
bits have gone from 1 to 0. We're talking about NOR flash. Erased bits are 1, written bits are 0. When deteriorating, trapped electrons escape the cells, and eventually, the affected cell(s) will read as 1 again.
Note that not one of the recognizable corruptions in this case has bits flipped to 1.
Yes, I think you're correct.

Quote
So, in my opinion, what we see here is flash write gone wrong. There are several possible causes *, especially since the camera in question was taken apart. I suspect that RAM was corrupted while the camera tried to save its config at shutdown, or maybe there was a power fluctuation during the write process.

edit:
* Now that I read the owner's first post again, the camera was in working condition after lens replacement. The problems started after "installing" CHDK. So, we might be seeing another "failures suspected to be caused by CHDK" case.
Hmm. I guess it's possible but I think more likely something to do with failing hardware or the repair.

One thing I was going to ask the user if they respond is to do another dump, just to see if the corruption is identical each time. I suspect so, since the symptom is consistent with that one in the sound code triggering a crash when buttons are used.

@Caefix:
I'm not sure why you asked them for a romlog. It's clear the ROM is damaged or there is some hardware problem corrupting data, and in any case the log area is included in the dump albeit in binary form.
Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4396
Re: Another confirmed case of firmware bit rot
« Reply #35 on: 02 / March / 2021, 17:17:49 »
fix the corrupted words temporarily, by making a special CHDK build, using cache hacks.
I'm not sure if following is needed, but just in case, a short howto. Assuming the latest published cache_hacks.h header and that patch in the cache hacks thread.

At an early point of platform boot.c, addresses are for g11 100k:

Code: [Select]
cache_lock();
// following two lines is one way to make cache_clean_flush_range "safe"
cache_fake(0xFF810910, 0xe1a00000, TYPE_ICACHE);
cache_fake(0xFF810918, 0xe1a00000, TYPE_ICACHE);

followed by more cache_fake() calls for the corrupted words

*

Offline reyalp

  • ******
  • 13141
Re: Another confirmed case of firmware bit rot
« Reply #36 on: 02 / March / 2021, 19:04:29 »
fix the corrupted words temporarily, by making a special CHDK build, using cache hacks.
I'm not sure if following is needed, but just in case, a short howto. Assuming the latest published cache_hacks.h header and that patch in the cache hacks thread.

At an early point of platform boot.c, addresses are for g11 100k:

Code: [Select]
cache_lock();
// following two lines is one way to make cache_clean_flush_range "safe"
cache_fake(0xFF810910, 0xe1a00000, TYPE_ICACHE);
cache_fake(0xFF810918, 0xe1a00000, TYPE_ICACHE);

followed by more cache_fake() calls for the corrupted words
Thanks, I found the threads but had not put it all together and tested yet :)

edit:
Finally got some time to test a bit on D10, seems to work :)

Some notes
as far as I understand the function is cache_clean_flush_range(type, address, size)
type == 0 is icache, 1 is dcache
size == -1 is a special case for all, but doesn't appear to be used.
clean doesn't make sense for icache, so the function just jumps into cache_flush_range in that case.

The first patch 0xFF810910 makes the icache check NOP, which would convert icache calls to dcache calls. If the intent is to NOP it, a BX LR might be a better choice. A more correct choice would be jump to our own clean function, which skips the locked set.

In practice, it doesn't seem like the camera will try to flush icache in normal operation. AdditionAgentRAM_FW is the only function I see that would potentially call it with icache. Restart does uses the cache control instructions directly, but shouldn't matter.

More concerning is clean_flush_and_disable which is called for dcache on flash write. My impression is flash writes normally happen around shutdown (for settings, file counter or romlog on crash), and I didn't lose any dcache patches after messing around with the camera for a while, but I'm less confident of this. I'll do a bit more testing before I put together a patch for the g11 user.

One could theoretically hook cache_flush_and_enable to re-apply dcache patches, but jumping to an arbitrary address is annoying when you can't patch data to use LDR PC.

BTW, if you do use LDR PC, you can put the data at the same address as the instruction. Then the instruction value is always ldr pc, [pc, #-8] (0xe51ff008)
like
Code: [Select]
    cache_fake(0xff870b14, 0xe51ff008, TYPE_ICACHE);
    cache_fake(0xff870b14, (unsigned)(some_chdk_function), TYPE_DCACHE);
Data patches should probably come after instruction, since you might otherwise pick up values in adjacent words when loading an icache line.
« Last Edit: 05 / March / 2021, 00:11:17 by reyalp »
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 13141
Re: Another confirmed case of firmware bit rot
« Reply #37 on: 07 / March / 2021, 00:30:40 »
Here's my proposed patch for the g11 user. I'll post a build for them in the chdkde forum later, but any comments or review are welcome.

Also attached is the code I tested on D10.

Based on testing, I decided to only patch the one case (data cache, size >= 0x2000) in cache_clean_flush_range. I'm fairly confident the other cases shouldn't be called in normal operation.

The patch checks once per spytask iteration that lockdown is still enabled and the expected values are in place. If any of these fail, it shows a message in the UI. It might be better to use DebugAssert to shut down the camera.

On D10, shooting stills and video, switching between rec and play, and changing menu settings did not trigger any failures.

Calling cache_clean_flush_range(0,0x1900,0x3000) (instruction cache, > 0x2000) and cache_clean_flush_range(1,0x1900,-1) (data cache, all) did trigger the expected bits in the message. So did using poke to modify the patched data at ff870d9c. cache_clean_flush_range(1,0x1900,0x3000)  (data cache, > 0x2000) did not.

My final list of corrupted addresses on the g11 is
Code: [Select]
ff96baa4 should be pointer: ffb838c8
ff9d8aa4 should be: 34 53 9f e5 - ldr, r5, ...
ffaaaaa4 should be pointer: ffaa9954
ffb1faa4 should be: c0  0f  9b  e8 - ldmia r11 ,{r6 r7 r8 r9 r10  r11 }
ffb24aa4 should be: ff 38  03  e2 - and r3,r3,#0xff0000
It's notable that all the addresses end in aa4. The space between the addresses doesn't show an obvious pattern. They're multiples of 4k, but not the presumed 64k sector size.
Code: [Select]
0xff96baa4
0xff9d8aa4  0x6d000
0xffaaaaa4  0xd2000
0xffb1faa4  0x75000
0xffb24aa4   0x5000

I also don't see anything really obvious in the changed values.
Code: [Select]
ff96baa4: ffb838c8 => c61030c0
ff9d8aa4: e59f5334 => a0000120
ffaaaaa4: ffaa9954 => 56029040
ffb1faa4: e89b0fc0 => 001a08c0
ffb24aa4: e20338ff => 200130e8
The ff in the first first and third show the same value wasn't written each time.

edit:
We know the value written where the original was 1. Comparing (original, corrupted, known write values)
Code: [Select]
0xffb838c8: 11111111 10111000 00111000 11001000
0xc61030c0: 11000110 00010000 00110000 11000000
0xc6??????: 11000110 0?010??? ??110??? 11??0???

0xe59f5334: 11100101 10011111 01010011 00110100
0xa0000120: 10100000 00000000 00000001 00100000
0x???0????: 101??0?0 0??00000 ?0?0??01 ??10?0??

0xffaa9954: 11111111 10101010 10011001 01010100
0x56029040: 01010110 00000010 10010000 01000000
0x56??????: 01010110 0?0?0?1? 1??10??0 ?1?0?0??

0xe89b0fc0: 11101000 10011011 00001111 11000000
0x001a08c0: 00000000 00011010 00001000 11000000
0x?????8??: 000?0??? 0??11?10 ????1000 11??????

0xe20338ff: 11100010 00000011 00111000 11111111
0x200130e8: 00100000 00000001 00110000 11101000
0x??????e8: 001???0? ??????01 ??110??? 11101000
« Last Edit: 07 / March / 2021, 02:49:55 by reyalp »
Don't forget what the H stands for.


*

Offline srsa_4c

  • ******
  • 4396
Re: Another confirmed case of firmware bit rot
« Reply #38 on: 07 / March / 2021, 14:24:28 »
The patch looks good to me. The added checks are probably a good idea, since none of us used cache hacked builds for an extended time.
I noticed that the crc check is untouched - it will fail in case the user decides to run it or the config gets reset. I'd probably remove the mandatory check from main.c, or, patch the instruction words in dcache too.

*

Offline reyalp

  • ******
  • 13141
Re: Another confirmed case of firmware bit rot
« Reply #39 on: 07 / March / 2021, 20:24:02 »
I noticed that the crc check is untouched - it will fail in case the user decides to run it or the config gets reset. I'd probably remove the mandatory check from main.c, or, patch the instruction words in dcache too.
Yeah, that's a good point. I was just going to warn the user, but better to make it work. I chose to patch the instructions, so the checksum can find other corruption if it happens.

Build and code posted in https://forum.chdk-treff.de/viewtopic.php?f=3&t=3664&p=32027#p32027

(also attaching patch here for posterity)
Don't forget what the H stands for.

 

Related Topics