help: how to debug script interrupting? - page 20 - General Discussion and Assistance - CHDK Forum

help: how to debug script interrupting?

  • 224 Replies
  • 32066 Views
*

Offline reyalp

  • ******
  • 13139
Re: help: how to debug script interrupting?
« Reply #190 on: 05 / May / 2013, 19:12:19 »
Advertisements
Throwing cold water on this theory: Running the autobuild, shooting, with a rag 'blanket' keeping the D10 warm, it hasn't hit the error yet in 1700 shots, CCD temp 54 (higher than any of the previous runs)

edit:
It did however die with "ERROR: NULL error message" on shot 1907
« Last Edit: 05 / May / 2013, 19:28:37 by reyalp »
Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4396
Re: help: how to debug script interrupting?
« Reply #191 on: 05 / May / 2013, 20:09:11 »
- trying CHDK compiled as ARM
We can strike this from the list. Thanks to Phil's changes, I was able to try ARM CHDK on the ixus110, and the test still managed to trigger the error.
Thanks for testing that. What script did you use to test?
outslider's, no shoot mode in af lock. Build was 2761 + reyalp-lua-debug-trunk-2735-2.patch . It happened pretty soon (46 cycles). Without looking into details, is it normal for the script to stop on this condition?

I wonder whether one of the following could have something to do with this issue:
- bug in a system (Canon) function. There have been lots of "Canon bugs" reported on the ML forum lately (don't know what caused them to surface, only saw the reports).
- the task in which context Lua runs blocks some other task, which fails to update something that "we" rely on. Already tried to mention this issue before (as "gcc optimization").

*

Offline reyalp

  • ******
  • 13139
Re: help: how to debug script interrupting?
« Reply #192 on: 06 / May / 2013, 01:06:05 »
It happened pretty soon (46 cycles). Without looking into details, is it normal for the script to stop on this condition?
If it hits the bug, it should continue, and just increment the bork_count global (b in misc debug vals) and write res1.dmp.

If the script died, then it isn't the same as what I've been seeing.
Quote
I wonder whether one of the following could have something to do with this issue:
- bug in a system (Canon) function. There have been lots of "Canon bugs" reported on the ML forum lately (don't know what caused them to surface, only saw the reports).
- the task in which context Lua runs blocks some other task, which fails to update something that "we" rely on. Already tried to mention this issue before (as "gcc optimization").
I'm open to both these options, but I don't see any specific mechanism or theory to test.

As far as I understand the bug:
In luaD_rawrunprotected
Code: [Select]
  L->errorJmp = lj.previous;  /* restore old error handler */
  buflog_put("<luaD_rawrunprotected");
  memcpy(&buflog.lj,&lj,sizeof(lj));
  return lj.status;
The copy of lj saved with memcpy here has status 0

Yet in lua_resume
Code: [Select]
  status = luaD_rawrunprotected(L, resume, L->top - nargs);
  if (status != 0) {  /* error? */
status is not 0, instead, it corresponds to the address of the instruction that put lj.status into r0, for e.g.
Code: [Select]
26fd9c: 6e20      ldr r0, [r4, #96] ; 0x60
the status seen in lua_resume is 0x26fd9c

:blink:

If this was a compiler bug, then it should show up in the disassembly. Nothing stands out, and it's very hard to see how it would work a million times and then fail.

If it's a Canon bug, then it has to be something has to be something asynchronous, like interrupts or task context switching. But if those are broken, how does it only happen in this spot?

We know it's not corruption of the code, because if we ignore the bogus return value, it can run thousands of iterations more without problems.

One thing I did notice which seems almost plausible:
The return from  luaD_rawrunprotected (in my builds and the autobuild) looks like this
Code: [Select]
  27ccb8: ab03      add r3, sp, #12
  27ccba: b01c      add sp, #112 ; 0x70
  27ccbc: 6e18      ldr r0, [r3, #96] ; 0x60
  27ccbe: bc10      pop {r4}
  27ccc0: bc02      pop {r1}
  27ccc2: 4708      bx r1
If something like an interrupt handler used the kbd_task stack (pushing stuff onto it and restoring it later) between the add sp and the ldr, then it could clobber the value actually loaded into r0. However, ARM does not generally do this: interrupt handlers get their own independent banked copy of SP. The fact that the compiler generates code like this suggests it's legal in normal ARM environments. Given that the compiler does generate it, we'd also expect the same pattern to occur elsewhere, though browsing the disassembly of the lua module, it doesn't seem terribly common. I thought it might special because of the volatile on lj.status, but taking that out didn't appear to affect the generated code.

I think this is very unlikely, but it's possible to test by saving a copy of the stack above SP, right after the return to lua_resume. I'm trying to run this now, but haven't triggered the bug with that code yet.
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 13139
Re: help: how to debug script interrupting?
« Reply #193 on: 06 / May / 2013, 02:10:27 »
Quote
I think this is very unlikely, but it's possible to test by saving a copy of the stack above SP, right after the return to lua_resume. I'm trying to run this now, but haven't triggered the bug with that code yet.
Now I have. The bad address appears to be on the stack after the return...

edit:
idea from philmoz
"could we test this by hand crafting a simple function with the right return code and call it in a loop"

This seems worthwhile.

Another approach is to investigate the interrupt handler and task switching code in the firmware. As nice as this theory sounds at the moment, I would definitely like to see the smoking gun.
« Last Edit: 06 / May / 2013, 02:44:40 by reyalp »
Don't forget what the H stands for.


*

Offline philmoz

  • *****
  • 3291
    • Photos
Re: help: how to debug script interrupting?
« Reply #194 on: 06 / May / 2013, 02:52:47 »
Quote
I think this is very unlikely, but it's possible to test by saving a copy of the stack above SP, right after the return to lua_resume. I'm trying to run this now, but haven't triggered the bug with that code yet.
Now I have. The bad address appears to be on the stack after the return...

edit:
idea from philmoz
"could we test this by hand crafting a simple function with the right return code and call it in a loop"

This seems worthwhile.

Another approach is to investigate the interrupt handler and task switching code in the firmware. As nice as this theory sounds at the moment, I would definitely like to see the smoking gun.

Another idea to avoid the problem, if the current theory is correct, would be to move the status value to before the jmpbuf memory in lj.

Phil

Edit:

I did some testing with various versions of GCC and optimisation settings to see which generated the suspect return code. (The 4.3.6 build did not have reyalp's debug patches hence the -Os build size is smaller than 4.4.0 and 4.5.1.)

GCC VerOptimisationCode genCore CHDK size
4.3.6-O1Ok155008
4.3.6-O2Suspect153472
4.3.6-OsSuspect147039
4.4.0-O1Ok157168
4.4.0-O2Suspect154480
4.4.0-OsSuspect147968
4.5.1-O1Ok156368
4.5.1-O2Suspect153472
4.5.1-OsSuspect147455
4.6.2-O1Ok155552
4.6.2-O2Ok153536
4.6.2-OsOk145615

So changing the optimisation settings (unless GCC 4.6.2 is being used) might be the way to go.
This will cost ~8-10K of memory on the core CHDK code (plus some overhead in the modules).

Note: current code has -Os as the optimisation setting.

Phil.
« Last Edit: 06 / May / 2013, 05:05:41 by philmoz »
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline philmoz

  • *****
  • 3291
    • Photos
Re: help: how to debug script interrupting?
« Reply #195 on: 06 / May / 2013, 08:06:13 »
Another approach is to investigate the interrupt handler and task switching code in the firmware. As nice as this theory sounds at the moment, I would definitely like to see the smoking gun.

I looked at the IRQ code in the firmware; but it made my head hurt and I got lost :)

One thought I had though was to record the tick count in lua_rawrunprotected just before leaving (before the SP adjustment) and then again in lua_resume when the error is detected. This might show if an interrupt / context switch occurred within the suspect instructions.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline reyalp

  • ******
  • 13139
Re: help: how to debug script interrupting?
« Reply #196 on: 06 / May / 2013, 13:06:32 »
One thought I had though was to record the tick count in lua_rawrunprotected just before leaving (before the SP adjustment) and then again in lua_resume when the error is detected. This might show if an interrupt / context switch occurred within the suspect instructions.
For context switches, I suspect we could just increment a counter in the dryos "task hook" and record it before and after rawrun return.

I think there might be a hook that's called in the irq code too, but I'll have to look at that more.

If I was going to guess (probably a bad idea) I'd expect the problem is context switches, e.g
physw task -> timer irq -> dryos scheduler -> some other task(s) ... ->  yields or timer irq -> dryos scheduler -> physw task. It seems less likely that an irq which is handled entirely in the irq handler code would be a problem, since it has it's own SP, and there's a well defined way to return from that without using the tasks stack.

Good information about compiler versions and optimization options. One thing I would worry about is whether a particular actually never generates this sequence or it might still choose to in some particular situation.

If we can break it down to an individual optimization option rather than the group ones like O1, O2, Os that might be helpful, and might help us avoid the size impact. http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html may help
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3291
    • Photos
Re: help: how to debug script interrupting?
« Reply #197 on: 06 / May / 2013, 15:52:44 »
If we can break it down to an individual optimization option rather than the group ones like O1, O2, Os that might be helpful, and might help us avoid the size impact. http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html may help

Good idea; but with over 30 options turned on by O2 it might take a while to narrow down (especially if it's a combination).
Might wait till the current theory is proven to be the cause.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)


*

Offline reyalp

  • ******
  • 13139
Re: help: how to debug script interrupting?
« Reply #198 on: 06 / May / 2013, 16:06:10 »
Good idea; but with over 30 options turned on by O2 it might take a while to narrow down (especially if it's a combination).
Might wait till the current theory is proven to be the cause.
I agree. If there's some that stand out as likely from name or description it might be worth trying, but otherwise not worth the effort at this point. If it does come down to trying them all, it should be possible to semi-automate it: a single c file with a function that shows the problem, a shell script that disables each option in turn, objdump on the resulting .o

Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 13139
Re: help: how to debug script interrupting?
« Reply #199 on: 06 / May / 2013, 16:46:38 »
I think there might be a hook that's called in the irq code too, but I'll have to look at that more.
From the d10 IRQ handler code
Code: [Select]
ROM:FF8102B8                 LDR     R1, =0x193C
ROM:FF8102BC                 LDR     R1, [R1]
ROM:FF8102C0                 CMP     R1, #0
ROM:FF8102C4                 MOVNE   R0, R4,LSR#2
ROM:FF8102C8                 MOVNE   LR, PC
ROM:FF8102CC                 MOVNE   PC, R1
...
ROM:FF810314                 LDR     R1, =0x1940
ROM:FF810318                 LDR     R1, [R1]
ROM:FF81031C                 CMP     R1, #0
ROM:FF810320                 MOVNE   R0, R4,LSR#2
ROM:FF810324                 MOVNE   LR, PC
ROM:FF810328                 MOVNE   PC, R1
I think this means that 0x193c and 0x1940 is effectively an "IRQ hooks". From a quick observation, it looks like they are null in normal operation. I assume R0 is something to do with the IRQ source.

Note that the IRQ handler and some associated variables are copied to instruction TCM (starting at 0x4B0) so if you look at it using the original ROM addresses, things are a bit off.
Don't forget what the H stands for.

 

Related Topics