help: how to debug script interrupting? - page 21 - General Discussion and Assistance - CHDK Forum
supplierdeeply

help: how to debug script interrupting?

  • 224 Replies
  • 44846 Views
*

Offline srsa_4c

  • ******
  • 4450
Re: help: how to debug script interrupting?
« Reply #200 on: 06 / May / 2013, 16:57:55 »
Advertisements
I still get B -> 1 using the last reported source + gcc 4.6.4 on the ixus110. I'm still testing with outslider's modified script, which still stops when this error happens. How can I determine the cause from the dumps? The end of luaD_rawrunprotected looks like this:

 3bb70c4:   4b0a         ldr   r3, [pc, #40]   (3bb70f0 <luaD_rawrunprotected+0x94>)
 3bb70c6:   480b         ldr   r0, [pc, #44]   (3bb70f4 <luaD_rawrunprotected+0x98>)
 3bb70c8:   f000 fcb7    bl   3bb7a3a <luaD_protectedparser+0x42>
 3bb70cc:   981b         ldr   r0, [sp, #108]
 3bb70ce:   b01c         add   sp, #112
 3bb70d0:   bc10         pop   {r4}
 3bb70d2:   bc02         pop   {r1}
 3bb70d4:   4708         bx   r1

Am I missing something?
« Last Edit: 06 / May / 2013, 16:59:33 by srsa_4c »

*

Offline reyalp

  • ******
  • 13718
Re: help: how to debug script interrupting?
« Reply #201 on: 06 / May / 2013, 17:22:59 »
I still get B -> 1 using the last reported source + gcc 4.6.4 on the ixus110. I'm still testing with outslider's modified script, which still stops when this error happens.
If you hit a normal lua error, or if the address doesn't match the expected range, the script will still stop and b will still increment. It shouldn't stop if it's the same error I've been seeing.

Look on the card for *.DMP files. If it's this bug, there should only be resN.dmp (N corresponding to the value of B after the error hit). If it's a "normal" lua error, then there should be several other dumps. Be sure to check the dates/times to see if they are all from the same event. If the timestamps on the files aren't useful, the time is recorded inside the dump.

To get some info from the dumps (using chdkptp and the latest lua files from svn)
Code: [Select]
!m=require'extras/dbgdump'
!d=m.load('RES1.DMP')
# gives you occurred time, core and module load addresses
!d:print_summary()
To get the log of which functions were called, you can use
Code: [Select]
# print the pointer to the end of the buffer
!print(d:udatau(0))
# print the buffer up to the end
!print(d:user_data_str():sub(5,d:udatau(0)+4))
At the end, you should see something like
Code: [Select]
script_run 299680
>lua_resume
>luaD_rawrunprotected
>resume
resume yield
sleep
>lua_yield
<lua_yield
<resume
<luaD_rawrunprotected
lua_resume 27ccbc
ignored
If the script stopped, the "ignored" probably won't be there
the lua_resume xxxxx tells you the value of status from luaD_rawrunprotected.

If you want get the lua module disassembled at the address it was loaded at when the dump was made, you can use
Code: [Select]
!os.execute(string.format('arm-elf-objdump -d -r -x --adjust-vma=0x%x lua.elf > lua.%x.objdump',d.modules[1].text_addr,d.modules[1].text_addr))

Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4450
Re: help: how to debug script interrupting?
« Reply #202 on: 06 / May / 2013, 17:28:31 »
If you hit a normal lua error, or if the address doesn't match the expected range, the script will still stop and b will still increment. It shouldn't stop if it's the same error I've been seeing.

Look on the card for *.DMP files. If it's this bug, there should only be resN.dmp (N corresponding to the value of B after the error hit). If it's a "normal" lua error, then there should be several other dumps.
Thanks, I have several dumps, not just res1.dmp, so it's false alarm then :)
And of course thanks for the above instructions.

*

Offline philmoz

  • *****
  • 3411
    • Photos
Re: help: how to debug script interrupting?
« Reply #203 on: 06 / May / 2013, 18:43:35 »
Another approach is to investigate the interrupt handler and task switching code in the firmware. As nice as this theory sounds at the moment, I would definitely like to see the smoking gun.

Another idea to see if the context switch is using the task stack.

Add assembler code in spytask just before the main loop that:
- writes a guard value to SP-4 (without changing SP)
- loops until the guard value changes
- saves or displays the altered value

If the theory is correct the loop will exit and the altered value will be an address in the spytask code - and you will have the smoking gun (or at least the bullet hole).

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)


*

Offline reyalp

  • ******
  • 13718
Re: help: how to debug script interrupting?
« Reply #204 on: 06 / May / 2013, 22:27:25 »
Thanks, I have several dumps, not just res1.dmp, so it's false alarm then :)
And of course thanks for the above instructions.
If the script is hitting a lua error after running many exposures, it would be good to know why. Do you see any errors in the console?

If you want to post the dumps, I can take a look at them.
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 13718
Re: help: how to debug script interrupting?
« Reply #205 on: 07 / May / 2013, 02:47:06 »
A new test patch. Basic idea
1) create a task that fills 64 words of stack, (more than a full set of registers)
2) adjust SP so it's "free"
3) sit in a tight loop comparing the first past sp with the fill value.
4) if it ever becomes not equal, record the values and end.

For convenience, this is exposed as two lua functions
test_start creates the task if it isn't already running. Returns nil if already running
test_get_info returns
boolean: test is running or not
number: number of iterations
string: 256 byte binary values recorded in step 3 above

This triggers very fast, and I think conclusively shows our problem.

example run
Code: [Select]
=return test_start()
!status,run,count,buf=con:execwait('return test_get_info()') if status and not run then printf('%d\n%s',count,util.hexdump_words(buf)) else return status, run end
63381
0x00000000: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000010: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000020: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000030: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000040: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000050: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000060: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000070: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000080: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000090: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000a0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000b0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000c0: 0x00000033 0xdeadbeef 0xdeadbeef 0x00333524
0x000000d0: 0x0000f795 0x001694dc 0x19980218 0x19980218
0x000000e0: 0x19980218 0x19980218 0x19980218 0x19980218
0x000000f0: 0x19980218 0x00000000 0x00146aa9 0x00146a66
We see it happened after 63381 iterations this time.

Note the values are "backwards" since the stack grows down. The last value is the value closest to SP.

Looking up 0x00146a66 in main.bin.dump:
Code: [Select]
  146a66: 6023      str r3, [r4, #0]
The address of the instruction that stores the counter. The next value is the LR value of the function.

Running a few more times on d10 gave trigger counts of 57888, 56407, 57380, 73440, 10889

Top values included 0x00146a60 (cmp   r0, r1), 0x00146a5e (ldr   r0, [r2, #0]), 0x00146a62 (bne.n   146a6a)

It hits fast enough to do in one command:
Code: [Select]
!status,run,count,buf=con:execwait('test_start(); sleep(10) ; return test_get_info()') if status and not run then printf('%d\n%s',count,util.hexdump_words(buf)) else return status, run end

on vxworks (a540) it hangs the camera, presumably because it doesn't want a task sitting in a tight loop like that.
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3411
    • Photos
Re: help: how to debug script interrupting?
« Reply #206 on: 07 / May / 2013, 04:27:17 »
A new test patch. Basic idea
1) create a task that fills 64 words of stack, (more than a full set of registers)
2) adjust SP so it's "free"
3) sit in a tight loop comparing the first past sp with the fill value.
4) if it ever becomes not equal, record the values and end.

This triggers very fast, and I think conclusively shows our problem.

Bravo - most definitely proves the theory.

Quote
example run
Code: [Select]
=return test_start()
!status,run,count,buf=con:execwait('return test_get_info()') if status and not run then printf('%d\n%s',count,util.hexdump_words(buf)) else return status, run end
63381
0x00000000: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000010: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000020: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000030: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000040: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000050: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000060: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000070: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000080: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000090: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000a0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000b0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000c0: 0x00000033 0xdeadbeef 0xdeadbeef 0x00333524
0x000000d0: 0x0000f795 0x001694dc 0x19980218 0x19980218
0x000000e0: 0x19980218 0x19980218 0x19980218 0x19980218
0x000000f0: 0x19980218 0x00000000 0x00146aa9 0x00146a66

Is that the maximum amount of task stack space it is using?
If so a quick fix would be to move 'status' to before 'b' in 'lj' - b is 92 bytes long so would protect the status value. Not a good fix I know; but a workaround for the time being.

Quote
on vxworks (a540) it hangs the camera, presumably because it doesn't want a task sitting in a tight loop like that.

Also shows vxworks is not doing pre-emptive task switches.

A quick scan of the core and module dumps (GCC 4.5.1, -Os) shows at least two other places where the bad code is being generated (shooting_get_prop and LoadChar).

So I can see two short term solutions:
- it's a rare problem and only affects Lua so far - so move the 'status' variable to protect it.
- or, change the global optimsations to O1 (until it can be narrowed down).

Edit: I looked at the optimisations enabled by O2 and tried the ones that did instruction re-ordering - got lucky and found that the dodgy optimisation appears to be ' -fschedule-insns2'.

Changing makefile.inc to the line below removes the bad code on 4.3.6, 4.4.0 and 4.5.1 when using -Os, with no impact to build size. (4.6.2 is unaffected)
Code: [Select]
CFLAGS=-fno-inline -Os -fno-strict-aliasing -fno-schedule-insns2
If you can confirm this finding on your builds we can add this to SVN and put this nightmare to rest :D

Phil.
« Last Edit: 07 / May / 2013, 05:06:56 by philmoz »
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

Re: help: how to debug script interrupting?
« Reply #207 on: 07 / May / 2013, 09:01:05 »
Wow - congratulations and a huge thank you to reyalp & philmoz!!

I've read all the posts but I'm still not quite sure what you found.   Is this a case of the 'optimized code' doing something bad with its stack pointer so that when an interrupt hits,  the stack gets corrupted ?  ( i.e. the stack pointer is not at the bottom of the stack where it should be ).  Or does the task context switch code doing something bad with the stack pointer it inherits from the interrupt?  (I guess I'd need to read more about ARM interrupt handling to answer that question myself.)

Also,  is this a compiler bug that should be submittted somewhere to whoever maintains the gcc code?

Ported :   A1200    SD940   G10    Powershot N    G16


*

Offline srsa_4c

  • ******
  • 4450
Re: help: how to debug script interrupting?
« Reply #208 on: 07 / May / 2013, 09:08:24 »
Edit: I looked at the optimisations enabled by O2 and tried the ones that did instruction re-ordering - got lucky and found that the dodgy optimisation appears to be ' -fschedule-insns2'.
Good find, congrats to you & reyalp. Perhaps we should put a note about the issue when appending -fno-schedule-insns2 to CFLAGS.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644

If the script is hitting a lua error after running many exposures, it would be good to know why. Do you see any errors in the console?
I forgot to look after this. Yes, "not enough memory". The script eats up memory somehow (note that this cam is using exmem).

*

Offline philmoz

  • *****
  • 3411
    • Photos
Re: help: how to debug script interrupting?
« Reply #209 on: 07 / May / 2013, 09:17:50 »
Wow - congratulations and a huge thank you to reyalp & philmoz!!

I've read all the posts but I'm still not quite sure what you found.   Is this a case of the 'optimized code' doing something bad with its stack pointer so that when an interrupt hits,  the stack gets corrupted ?  ( i.e. the stack pointer is not at the bottom of the stack where it should be ).  Or does the task context switch code doing something bad with the stack pointer it inherits from the interrupt?  (I guess I'd need to read more about ARM interrupt handling to answer that question myself.)

Also,  is this a compiler bug that should be submittted somewhere to whoever maintains the gcc code?

The compiler is re-ordering (optimising) instructions such that the stack pointer (SP) is adjusted before the status value is read from the stack memory.

The second part of the problem is the firmware task switcher uses the tasks stack when switching.

So an interrupt occurring after SP has been changed; but before the status value is loaded from memory means that the task switch code was clobbering the status value (since it was now in memory below the stack pointer).

It's a bit like using a memory buffer after calling 'free' on it.

Since GCC 4.6.2 no longer emits the incorrect code I suspect the problem has been reported and fixed.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

 

Related Topics