help: how to debug script interrupting?

srsa_4c

4451

Re: help: how to debug script interrupting?

« Reply #200 on: 06 / May / 2013, 16:57:55 »

Advertisements

I still get B -> 1 using the last reported source + gcc 4.6.4 on the ixus110. I'm still testing with outslider's modified script, which still stops when this error happens. How can I determine the cause from the dumps? The end of luaD_rawrunprotected looks like this:
3bb70c4: 4b0a ldr r3, [pc, #40] (3bb70f0 <luaD_rawrunprotected+0x94>) 3bb70c6: 480b ldr r0, [pc, #44] (3bb70f4 <luaD_rawrunprotected+0x98>) 3bb70c8: f000 fcb7 bl 3bb7a3a <luaD_protectedparser+0x42> 3bb70cc: 981b ldr r0, [sp, #108] 3bb70ce: b01c add sp, #112 3bb70d0: bc10 pop {r4} 3bb70d2: bc02 pop {r1} 3bb70d4: 4708 bx r1
Am I missing something?

« Last Edit: 06 / May / 2013, 16:59:33 by srsa_4c »

Logged

reyalp

14079

Re: help: how to debug script interrupting?

« Reply #201 on: 06 / May / 2013, 17:22:59 »

Quote from: srsa_4c on 06 / May / 2013, 16:57:55

I still get B -> 1 using the last reported source + gcc 4.6.4 on the ixus110. I'm still testing with outslider's modified script, which still stops when this error happens.

If you hit a normal lua error, or if the address doesn't match the expected range, the script will still stop and b will still increment. It shouldn't stop if it's the same error I've been seeing.

Look on the card for *.DMP files. If it's this bug, there should only be resN.dmp (N corresponding to the value of B after the error hit). If it's a "normal" lua error, then there should be several other dumps. Be sure to check the dates/times to see if they are all from the same event. If the timestamps on the files aren't useful, the time is recorded inside the dump.

To get some info from the dumps (using chdkptp and the latest lua files from svn)

Code: [Select]

!m=require'extras/dbgdump'
!d=m.load('RES1.DMP')
# gives you occurred time, core and module load addresses
!d:print_summary()

To get the log of which functions were called, you can use

Code: [Select]

# print the pointer to the end of the buffer
!print(d:udatau(0))
# print the buffer up to the end
!print(d:user_data_str():sub(5,d:udatau(0)+4))

At the end, you should see something like

Code: [Select]

script_run 299680
>lua_resume
>luaD_rawrunprotected
>resume
resume yield
sleep
>lua_yield
<lua_yield
<resume
<luaD_rawrunprotected
lua_resume 27ccbc
ignored

If the script stopped, the "ignored" probably won't be there
the lua_resume xxxxx tells you the value of status from luaD_rawrunprotected.

If you want get the lua module disassembled at the address it was loaded at when the dump was made, you can use

Code: [Select]

!os.execute(string.format('arm-elf-objdump -d -r -x --adjust-vma=0x%x lua.elf > lua.%x.objdump',d.modules[1].text_addr,d.modules[1].text_addr))

Logged

Don't forget what the H stands for.

srsa_4c

4451

Re: help: how to debug script interrupting?

« Reply #202 on: 06 / May / 2013, 17:28:31 »

Quote from: reyalp on 06 / May / 2013, 17:22:59

If you hit a normal lua error, or if the address doesn't match the expected range, the script will still stop and b will still increment. It shouldn't stop if it's the same error I've been seeing.

Look on the card for *.DMP files. If it's this bug, there should only be resN.dmp (N corresponding to the value of B after the error hit). If it's a "normal" lua error, then there should be several other dumps.

Thanks, I have several dumps, not just res1.dmp, so it's false alarm then

And of course thanks for the above instructions.

Logged

philmoz

3450

Re: help: how to debug script interrupting?

« Reply #203 on: 06 / May / 2013, 18:43:35 »

Quote from: reyalp on 06 / May / 2013, 02:10:27

Another approach is to investigate the interrupt handler and task switching code in the firmware. As nice as this theory sounds at the moment, I would definitely like to see the smoking gun.

Another idea to see if the context switch is using the task stack.

Add assembler code in spytask just before the main loop that:
- writes a guard value to SP-4 (without changing SP)
- loops until the guard value changes
- saves or displays the altered value

If the theory is correct the loop will exit and the altered value will be an address in the spytask code - and you will have the smoking gun (or at least the bullet hole).

Phil.

Logged

CHDK ports:
sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
g12 (1.00c, 1.00e, 1.00f & 1.00g)
sx130is (1.01d & 1.01f)
ixus310hs (1.00a & 1.01a)
sx40hs (1.00d, 1.00g & 1.00i)
g1x (1.00e, 1.00f & 1.00g)
g5x (1.00c, 1.01a, 1.01b)
g7x2 (1.01a, 1.01b, 1.10b)

reyalp

14079

Re: help: how to debug script interrupting?

« Reply #204 on: 06 / May / 2013, 22:27:25 »

Quote from: srsa_4c on 06 / May / 2013, 17:28:31

Thanks, I have several dumps, not just res1.dmp, so it's false alarm then
And of course thanks for the above instructions.

If the script is hitting a lua error after running many exposures, it would be good to know why. Do you see any errors in the console?

If you want to post the dumps, I can take a look at them.

Logged

Don't forget what the H stands for.

reyalp

14079

Re: help: how to debug script interrupting?

« Reply #205 on: 07 / May / 2013, 02:47:06 »

A new test patch. Basic idea
1) create a task that fills 64 words of stack, (more than a full set of registers)
2) adjust SP so it's "free"
3) sit in a tight loop comparing the first past sp with the fill value.
4) if it ever becomes not equal, record the values and end.

For convenience, this is exposed as two lua functions
test_start creates the task if it isn't already running. Returns nil if already running
test_get_info returns
boolean: test is running or not
number: number of iterations
string: 256 byte binary values recorded in step 3 above

This triggers very fast, and I think conclusively shows our problem.

example run

Code: [Select]

 =return test_start()
!status,run,count,buf=con:execwait('return test_get_info()') if status and not run then printf('%d\n%s',count,util.hexdump_words(buf)) else return status, run end
63381
0x00000000: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000010: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000020: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000030: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000040: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000050: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000060: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000070: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000080: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x00000090: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000a0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000b0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef
0x000000c0: 0x00000033 0xdeadbeef 0xdeadbeef 0x00333524
0x000000d0: 0x0000f795 0x001694dc 0x19980218 0x19980218
0x000000e0: 0x19980218 0x19980218 0x19980218 0x19980218
0x000000f0: 0x19980218 0x00000000 0x00146aa9 0x00146a66

We see it happened after 63381 iterations this time.

Note the values are "backwards" since the stack grows down. The last value is the value closest to SP.

Looking up 0x00146a66 in main.bin.dump:

Code: [Select]

  146a66:	6023      	str	r3, [r4, #0]

The address of the instruction that stores the counter. The next value is the LR value of the function.

Running a few more times on d10 gave trigger counts of 57888, 56407, 57380, 73440, 10889

Top values included 0x00146a60 (cmp r0, r1), 0x00146a5e (ldr r0, [r2, #0]), 0x00146a62 (bne.n 146a6a)

It hits fast enough to do in one command:

Code: [Select]

!status,run,count,buf=con:execwait('test_start(); sleep(10) ; return test_get_info()') if status and not run then printf('%d\n%s',count,util.hexdump_words(buf)) else return status, run end

on vxworks (a540) it hangs the camera, presumably because it doesn't want a task sitting in a tight loop like that.

reyalp-lua-debug-trunk-2761-2.patch (3.85 kB - downloaded 16 times.)

Logged

Don't forget what the H stands for.

philmoz

3450

Re: help: how to debug script interrupting?

« Reply #206 on: 07 / May / 2013, 04:27:17 »

Quote from: reyalp on 07 / May / 2013, 02:47:06

A new test patch. Basic idea
1) create a task that fills 64 words of stack, (more than a full set of registers)
2) adjust SP so it's "free"
3) sit in a tight loop comparing the first past sp with the fill value.
4) if it ever becomes not equal, record the values and end.

This triggers very fast, and I think conclusively shows our problem.

Bravo - most definitely proves the theory.

Quote

example run
Code: [Select]
=return test_start() !status,run,count,buf=con:execwait('return test_get_info()') if status and not run then printf('%d\n%s',count,util.hexdump_words(buf)) else return status, run end 63381 0x00000000: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000010: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000020: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000030: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000040: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000050: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000060: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000070: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000080: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x00000090: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x000000a0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x000000b0: 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef 0x000000c0: 0x00000033 0xdeadbeef 0xdeadbeef 0x00333524 0x000000d0: 0x0000f795 0x001694dc 0x19980218 0x19980218 0x000000e0: 0x19980218 0x19980218 0x19980218 0x19980218 0x000000f0: 0x19980218 0x00000000 0x00146aa9 0x00146a66

Is that the maximum amount of task stack space it is using?
If so a quick fix would be to move 'status' to before 'b' in 'lj' - b is 92 bytes long so would protect the status value. Not a good fix I know; but a workaround for the time being.

Quote

on vxworks (a540) it hangs the camera, presumably because it doesn't want a task sitting in a tight loop like that.

Also shows vxworks is not doing pre-emptive task switches.

A quick scan of the core and module dumps (GCC 4.5.1, -Os) shows at least two other places where the bad code is being generated (shooting_get_prop and LoadChar).

So I can see two short term solutions:
- it's a rare problem and only affects Lua so far - so move the 'status' variable to protect it.
- or, change the global optimsations to O1 (until it can be narrowed down).

Edit: I looked at the optimisations enabled by O2 and tried the ones that did instruction re-ordering - got lucky and found that the dodgy optimisation appears to be ' -fschedule-insns2'.

Changing makefile.inc to the line below removes the bad code on 4.3.6, 4.4.0 and 4.5.1 when using -Os, with no impact to build size. (4.6.2 is unaffected)

Code: [Select]

CFLAGS=-fno-inline -Os -fno-strict-aliasing -fno-schedule-insns2

If you can confirm this finding on your builds we can add this to SVN and put this nightmare to rest

Phil.

« Last Edit: 07 / May / 2013, 05:06:56 by philmoz »

Logged

CHDK ports:
sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
g12 (1.00c, 1.00e, 1.00f & 1.00g)
sx130is (1.01d & 1.01f)
ixus310hs (1.00a & 1.01a)
sx40hs (1.00d, 1.00g & 1.00i)
g1x (1.00e, 1.00f & 1.00g)
g5x (1.00c, 1.01a, 1.01b)
g7x2 (1.01a, 1.01b, 1.10b)

waterwingz

12138

Re: help: how to debug script interrupting?

« Reply #207 on: 07 / May / 2013, 09:01:05 »

Wow - congratulations and a huge thank you to reyalp & philmoz!!

I've read all the posts but I'm still not quite sure what you found. Is this a case of the 'optimized code' doing something bad with its stack pointer so that when an interrupt hits, the stack gets corrupted ? ( i.e. the stack pointer is not at the bottom of the stack where it should be ). Or does the task context switch code doing something bad with the stack pointer it inherits from the interrupt? (I guess I'd need to read more about ARM interrupt handling to answer that question myself.)

Also, is this a compiler bug that should be submittted somewhere to whoever maintains the gcc code?

Logged

Ported : A1200 SD940 G10 Powershot N G16

srsa_4c

4451

Re: help: how to debug script interrupting?

« Reply #208 on: 07 / May / 2013, 09:08:24 »

Quote from: philmoz on 07 / May / 2013, 04:27:17

Edit: I looked at the optimisations enabled by O2 and tried the ones that did instruction re-ordering - got lucky and found that the dodgy optimisation appears to be ' -fschedule-insns2'.

Good find, congrats to you & reyalp. Perhaps we should put a note about the issue when appending -fno-schedule-insns2 to CFLAGS.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38644

Quote from: reyalp on 06 / May / 2013, 22:27:25

If the script is hitting a lua error after running many exposures, it would be good to know why. Do you see any errors in the console?

I forgot to look after this. Yes, "not enough memory". The script eats up memory somehow (note that this cam is using exmem).

Logged

philmoz

3450

Re: help: how to debug script interrupting?

« Reply #209 on: 07 / May / 2013, 09:17:50 »

Quote from: waterwingz on 07 / May / 2013, 09:01:05

Wow - congratulations and a huge thank you to reyalp & philmoz!!

I've read all the posts but I'm still not quite sure what you found. Is this a case of the 'optimized code' doing something bad with its stack pointer so that when an interrupt hits, the stack gets corrupted ? ( i.e. the stack pointer is not at the bottom of the stack where it should be ). Or does the task context switch code doing something bad with the stack pointer it inherits from the interrupt? (I guess I'd need to read more about ARM interrupt handling to answer that question myself.)

Also, is this a compiler bug that should be submittted somewhere to whoever maintains the gcc code?

The compiler is re-ordering (optimising) instructions such that the stack pointer (SP) is adjusted before the status value is read from the stack memory.

The second part of the problem is the firmware task switcher uses the tasks stack when switching.

So an interrupt occurring after SP has been changed; but before the status value is loaded from memory means that the task switch code was clobbering the status value (since it was now in memory below the stack pointer).

It's a bit like using a memory buffer after calling 'free' on it.

Since GCC 4.6.2 no longer emits the incorrect code I suspect the problem has been reported and fixed.

Phil.

Logged

CHDK ports:
sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
g12 (1.00c, 1.00e, 1.00f & 1.00g)
sx130is (1.01d & 1.01f)
ixus310hs (1.00a & 1.01a)
sx40hs (1.00d, 1.00g & 1.00i)
g1x (1.00e, 1.00f & 1.00g)
g5x (1.00c, 1.01a, 1.01b)
g7x2 (1.01a, 1.01b, 1.10b)