supplierdeeply

A540 histogram + zebra crash with EABI build

  • 14 Replies
  • 2106 Views
*

Offline reyalp

  • ******
  • 11392
A540 histogram + zebra crash with EABI build
« on: 16 / August / 2015, 01:12:31 »
Advertisements
I noticed this when we first started playing with the EABI builds, but hadn't spent to much time on it.

Both the gcc 4.9.3 included with the current chdkshell, and 4.8.4 I downloaded earlier seem to have the exact same crash. Zebra hangs and the camera shuts down after a few seconds, without any romlog.

histogram crashes with a data abort in histogram_process, as soon as histogram is enabled.

arm-elf builds with gcc 4.5.1 seem fine, as is the current autobuild (gcc 4.4.3)

The crash has happened with trunk version from 3992 to 4223

All my other cameras (all digic 4 / dryos) seems to be fine.

I'd be interested to know if other cameras suffer the same problem, especially similar generation to a540. I can provide builds if needed.

Although I get a romlog for the histogram crash, I haven't been able to make sense of it.

Code: [Select]
___> !require'extras/vxromlog'.load('romlog-histo-2015-08-15_1.log'):print_all()
Exception vector 0x10 (Data abort)
Occured at 2015:08:15 16:28:54
Task ID: 3145144
Task Name: tSpyTask
Registers:
r0   0x000000f0          240
r1   0x105f17a0    274667424 < vid_get_viewport_fb value
r2   0x00008001        32769
r3   0x00032c68       207976
r4   0x001c0d70      1838448 < in module bss, histogram_proc?
r5   0x00000000            0
r6   0x00002739        10041
r7   0x000b5340       742208
r8   0x00000000            0
r9   0x00000000            0
r10  0x00000000            0
r11  0x00000000            0
r12  0x00000000            0
sp   0x002ffce0      3144928
lr   0x001c0387      1835911
pc   0x001c0380      1835904 < in histogram module (histogram_process)
cpsr 0x80000013  -2147483629
Stack:
0x001c02ad      1835693
0x00000003            3
0x00015180        86400
0x6f747369   1869902697
0x746c662e   1953261102
0x00000000            0
0x00001f24         7972
0x000a7d75       687477
0x00000008            8
0x00000100          256
0x000af1d8       717272
0x00002738        10040
0x00008001        32769
0x010d0102     17629442
0x00000100          256
0x000b1238       725560
0x000af1d8       717272
0x000b11d4       725460
0x000af1d8       717272
0x00002739        10041
0x000b5340       742208
0x000a49e7       674279
0x000b41d8       737752
0x000961c3       614851 < return core_spytask, from indirect call after gui_redraw, probably libhisto->histogram_process()
0xffeefa18     -1115624
0xffec2618     -1300968
0x00000000            0
0xffeee44c     -1121204
0x00000000            0
0x00000000            0
0x00000000            0
0x00000000            0
modules.log
Code: [Select]
Tick    ,Op,Address ,Name (2015:08:15 16:27:45)
  133310,LD,001a3a80,lua.flt
  203260,LD,001c0278,histo.flt

Disassembling the elf so .text starts at 0x1c029c (load address + 36 byte header) gives
Code: [Select]
  1c0378: 4bab      ldr r3, [pc, #684] ; (1c0628 <histogram_process+0x30c>) < vid_get_viewport_active_buffer
  1c037a: 4798      blx r3
  1c037c: 69a3      ldr r3, [r4, #24]
  1c037e: 1818      adds r0, r3, r0
  1c0380: 61a0      str r0, [r4, #24] < crash???
  1c0382: 4baa      ldr r3, [pc, #680] ; (1c062c <histogram_process+0x310>)
  1c0384: 4798      blx r3
This doesn't make sense to me: data abort should be an invalid memory access, but r4 is in the module bss, and definitely a valid RAM address. Plus it was just accessed by the preceding LDR.

I'm pretty sure the romlog for data abort gives the PC where the actual exception happened, but if it was one instruction before or after, it wouldn't make any more sense.

I don't think I've disassembled the .elf at the wrong address. If I disassemble the actual flt that was on the camera, 0x1c0380 - 0x1c0278 = 0x108 gives the same instruction.

Other than zebra and histogram, the build seems to work OK, although I haven't used it extensively.
Don't forget what the H stands for.

*

Offline blackhole

  • *****
  • 571
  • A590IS 101b
    • Planetary astrophotography
Re: A540 histogram + zebra crash with EABI build
« Reply #1 on: 16 / August / 2015, 05:13:40 »
Quote
I'd be interested to know if other cameras suffer the same problem, especially similar generation to a540. I can provide builds if needed.
Both my cameras a530 and a590 works well with EABI builds (gcc 4.8.4) from chdkshell,trunk version 4175 (seems like chdkshell is in the problems with downloading recent trunk,again).
There is no problems with zebra and histogram.

*

Offline philmoz

  • *****
  • 3070
    • Photos
Re: A540 histogram + zebra crash with EABI build
« Reply #2 on: 16 / August / 2015, 07:18:09 »
If the disassembly address is correct then wouldn't the LR value in the stack trace mean that the BLX instruction at 1c0384 had been executed? This is after the PC value which is very odd.

Reminds me of the optimisation instruction re-ordering problem we had a while back.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)

*

Offline srsa_4c

  • ******
  • 3654
Re: A540 histogram + zebra crash with EABI build
« Reply #3 on: 16 / August / 2015, 08:25:44 »
Some thoughts.
- AFAIK this is the only port that has those unusual viewport dimensions - this might trigger some bugs
- RAM may be corrupted more or less
- would it be possible to write a custom exception handler? It could be used to make a RAM dump (assuming the problem really only affects spytask). If using file operations from that context is not possible, then the custom handler could give a signal to a sleeping task to make the RAM dump.

edit:
Probably not related, but who knows. I have a hack to get rid of the E32 error (broken image stabilizer). I'm (incorrectly) calling a fw stub directly from spytask (core/main.c). Arm-elf toolchains generate correct code for this situation (thumb-arm-thumb interworking). As I recently found out, my arm-none-eabi toolchain does not generate correct interworking code for this case (it just calls an ARM function from thumb which of course doesn't work).

edit2:
Tried modifying the a410 viewport functions to return those odd widths (352, 704), but I still can't get it to crash.
« Last Edit: 16 / August / 2015, 09:45:41 by srsa_4c »


*

Offline reyalp

  • ******
  • 11392
Re: A540 histogram + zebra crash with EABI build
« Reply #4 on: 16 / August / 2015, 20:12:00 »
Thanks everyone. I did some more in depth testing, hitting pretty much all the modules. This was mostly done with srsa's d6 patch, but similar crashes are seen using the current trunk built with the same compiler.

benchmark crashed with romlog, ~1 second after opening, without even pressing set to start a benchmark run.
Note the bench mark module is different in the d6 code. The trunk benchmark crashed after pressing set in rec mode.  correction -  this crashes just opening the benchmark and waiting ~1 sec, like the d6 patch

benchmark didn't crash in either version in playback.

edge overlay crashed with romlog in rec on pressing half shoot. Did not crash in play with edge enabled for playback.

mdfb ubasic script crashed with romlog when MD was triggered.

I ran ubtest and llibtst and shot some DNGs without problems.

I tested pretty much the same things on D10 and didn't get any crashes. I haven't tested each one a whole bunch of times, but the a540 crashes seem to be very repeatable.

All of the romlogs from the crashes above are data aborts, and they are all weird.

benchmark (d6 patch):
Code: [Select]
lr   0x00002cc9        11465 < doesn't seem like a valid lr?
pc   0x001ba9d0      1812944 < not in bench module, but close?
cpsr 0xa0000033  -1610612685
Stack:
0x00002c8c        11404
0x00098b4f       625487 < in gui_redraw, gui_mode->redraw
0x00000000            0

Edge overlay
Code: [Select]
lr   0x001bf4d7      1832151 < ???
pc   0x001bf4d0      1832144 < _module_loader 1bf4d0: bd00      pop {pc}

Md
Code: [Select]
Task Name: tSpyTask < most MD code runs in kbd_task???

lr   0x001bf4d7      1832151 < similar to other LRs
pc   0x001bf4d0      1832144 < not in md or ubasic module, but close?
cpsr 0x80000013  -2147483629
Stack:
0x001c0138      1835320
0x001bfd67      1834343
0x00000000            0
0x00098fb5       626613 < kbd_is_key_pressed

Note edge and md has a weird LR (PC+7) similar to the what phil noted in the histogram crash.

Quote
- AFAIK this is the only port that has those unusual viewport dimensions - this might trigger some bugs
I thought there were a few others, but they may not report them correctly.

It's notable that all of these interact with the viewport, although I don't think just loading the benchmark module without starting a run should.

Quote
- RAM may be corrupted more or less
Yes, RAM corruption would seem to fit.

Cache issues could also produce this kind of symptom, though that seems less likely to be compiler specific.

Stack overflow or corruption might be another possibility, though the symptoms were a bit different when I ran into that with the DNG stuff http://chdk.setepontos.com/index.php?topic=9970.msg100941#msg100941

Quote
- would it be possible to write a custom exception handler? It could be used to make a RAM dump (assuming the problem really only affects spytask).
Good idea. ixus80 actually has code for custom exception handlers, though it's dryos.

« Last Edit: 16 / August / 2015, 21:03:00 by reyalp »
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 11392
Re: A540 histogram + zebra crash with EABI build
« Reply #5 on: 18 / August / 2015, 02:06:26 »
I tried writing the freshly loaded benchm module to SD card at the end of module_preload. It still worked in playback and crashed in rec, and the only difference between the two dumps were addresses adjusted for linking, which looked reasonable in both cases.
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 11392
Re: A540 histogram + zebra crash with EABI build
« Reply #6 on: 23 / August / 2015, 22:14:58 »
This appears to be an interworking problem, though perhaps not quite the same one srsa mentioned above. If my conclusion is correct, the EABI builds should NOT be used until it is resolved, except maybe for digic6.

I narrowed the benchmark failure down to a call to vid_get_viewport_height()

On a540, this function calls _GetVRAMVPixelsSize(), but only if in REC mode.

In the EABI build, the disassembly looks like this

1) thumb to ARM. Note it does B to the arm function, so the existing thumb LR will be used.
Code: [Select]
000a7444 <__vid_get_viewport_height_from_thumb>:
   a7444: 4778      bx pc
   a7446: 46c0      nop ; (mov r8, r8)
   a7448: eaffb4eb b 947fc <vid_get_viewport_height>

2) vid_get_viewport_height just calls vid_get_viewport_height_proper, again with a straight B and no return instruction
Code: [Select]
000947fc <vid_get_viewport_height>:
   947fc: eafffff0 b 947c4 <vid_get_viewport_height_proper>
3)
Code: [Select]
000947c4 <vid_get_viewport_height_proper>:
   947c4: e92d4008 push {r3, lr}
   947c8: eb0049ea bl a6f78 <__mode_get_from_arm>
   947cc: e2003c03 and r3, r0, #768 ; 0x300
   947d0: e3530c02 cmp r3, #512 ; 0x200
   947d4: 0a000004 beq 947ec <vid_get_viewport_height_proper+0x28>
   947d8: e20000ff and r0, r0, #255 ; 0xff
   947dc: e350002c cmp r0, #44 ; 0x2c
   947e0: 0a000003 beq 947f4 <vid_get_viewport_height_proper+0x30>
   947e4: e8bd4008 pop {r3, lr}
   947e8: ea000073 b 949bc <_GetVRAMVPixelsSize>
   947ec: e3a000f0 mov r0, #240 ; 0xf0
   947f0: e8bd8008 pop {r3, pc}
   947f4: e3a00078 mov r0, #120 ; 0x78
   947f8: e8bd8008 pop {r3, pc}
Note the B to _GetVRAMVPixelsSize, after restoring the original LR. So, Canon firmware code gets a thumb LR, but only in rec mode. The compiler appears to assume that any function it calls will return in a thumb-safe manner, which is presumably valid for the code it generates, but is not for the firmware code.

This is less likely to be a problem on DryOS cameras, because they generally use BX or another thumb safe return. The VxWorks cameras frequently use  MOV     PC, LR (as _GetVRAMVPixelsSize does on A540)

I assume the reason this isn't more of a problem is that there is only risk of it occurring when the compiler optimizes a function call like this.

The corresponding code in the arm-elf build is
Code: [Select]
00094758 <vid_get_viewport_height_proper>:
   94758: e52de004 push {lr} ; (str lr, [sp, #-4]!)
   9475c: eb004f39 bl a8448 <__mode_get_from_arm>
   94760: e2003c03 and r3, r0, #768 ; 0x300
   94764: e3530c02 cmp r3, #512 ; 0x200
   94768: 03a000f0 moveq r0, #240 ; 0xf0
   9476c: 0a000003 beq 94780 <vid_get_viewport_height_proper+0x28>
   94770: e20000ff and r0, r0, #255 ; 0xff
   94774: e350002c cmp r0, #44 ; 0x2c
   94778: 03a00078 moveq r0, #120 ; 0x78
   9477c: 1b000073 blne 94950 <_GetVRAMVPixelsSize>
   94780: e49de004 pop {lr} ; (ldr lr, [sp], #4)
   94784: e12fff1e bx lr
Which does a BL _GetVRAMVPixelsSize and so will end up using it's own BX LR.

So the question is, is there a way to inform the compiler that firmware stubs are NOT thumb safe?

edit:
I should add that I'm not sure if the elf compiler is actually guarantied to do the right thing, or it's just less aggressive about this kind of optimization.

edit:
PTP live view also suffers the crash, as expected.

An alternate approach might be to make the stubs generate a thumb safe return, but I haven't come up with an efficient way to do it.
« Last Edit: 23 / August / 2015, 22:35:37 by reyalp »
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3070
    • Photos
Re: A540 histogram + zebra crash with EABI build
« Reply #7 on: 23 / August / 2015, 23:16:20 »
If you move the offending code from platform/a540/sub/100b/lib.c to a540/sub/lib.c it will compile the wrappers correctly.

The current build rules are missing 'CFLAGS+=-march=armv4' for EABI in the 'sub/firmware' build (plaform/makefile_sub.inc). I can't just add it there though because that breaks the thumb build in the 'sub/firmware' directory.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)


*

Offline reyalp

  • ******
  • 11392
Re: A540 histogram + zebra crash with EABI build
« Reply #8 on: 24 / August / 2015, 00:29:25 »
If you move the offending code from platform/a540/sub/100b/lib.c to a540/sub/lib.c it will compile the wrappers correctly.

The current build rules are missing 'CFLAGS+=-march=armv4' for EABI in the 'sub/firmware' build (plaform/makefile_sub.inc). I can't just add it there though because that breaks the thumb build in the 'sub/firmware' directory.
Thanks for the tip, I didn't appreciate the significance of the discussion in http://chdk.setepontos.com/index.php?topic=12115.0 the first time around.

IMO, we need a more general solution before EABI builds are ready for prime time.
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3070
    • Photos
Re: A540 histogram + zebra crash with EABI build
« Reply #9 on: 24 / August / 2015, 00:45:53 »
If you move the offending code from platform/a540/sub/100b/lib.c to a540/sub/lib.c it will compile the wrappers correctly.

The current build rules are missing 'CFLAGS+=-march=armv4' for EABI in the 'sub/firmware' build (plaform/makefile_sub.inc). I can't just add it there though because that breaks the thumb build in the 'sub/firmware' directory.
Thanks for the tip, I didn't appreciate the significance of the discussion in http://chdk.setepontos.com/index.php?topic=12115.0 the first time around.

IMO, we need a more general solution before EABI builds are ready for prime time.

I was not able to find a compile option to control the interworking wrapper code generated, other than the ARM architecture version.

Unfortunately this means platform/camera code needs 'armv4' to get the correct wrappers; but platform/camera/sub/firmware needs 'armv5te' for many cameras to support the instruction set being used.

Agreed it needs a better solution; but I don't have one right now.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)

 

Related Topics