help: how to debug script interrupting? - page 14 - General Discussion and Assistance - CHDK Forum

help: how to debug script interrupting?

  • 224 Replies
  • 66537 Views
*

Offline reyalp

  • ******
  • 14128
Re: help: how to debug script interrupting?
« Reply #130 on: 27 / April / 2013, 19:09:46 »
Advertisements
One of the changes I made in my debug code was to store a copy of lj in the debug buffer at the end of every call to  luaD_rawrunprotected

Code: [Select]
  L->errorJmp = lj.previous;  /* restore old error handler */
  buflog_put("<luaD_rawrunprotected");
  memcpy(&buflog.lj,&lj,sizeof(lj));
  return lj.status;

In my dump this shows that lj.status was indeed 0 :blink:

edit:
Also, the lres value:
26fd9c

Relocated lua.elf code:
Code: [Select]
26fd9c: 6e20      ldr r0, [r4, #96] ; 0x60
is indeed (as philmoz observed in lapsers thread) the address of the instruction that should be loading the return value.
« Last Edit: 27 / April / 2013, 19:22:37 by reyalp »
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3450
    • Photos
Re: help: how to debug script interrupting?
« Reply #131 on: 28 / April / 2013, 03:29:31 »
One of the changes I made in my debug code was to store a copy of lj in the debug buffer at the end of every call to  luaD_rawrunprotected

Code: [Select]
  L->errorJmp = lj.previous;  /* restore old error handler */
  buflog_put("<luaD_rawrunprotected");
  memcpy(&buflog.lj,&lj,sizeof(lj));
  return lj.status;

In my dump this shows that lj.status was indeed 0 :blink:

edit:
Also, the lres value:
26fd9c

Relocated lua.elf code:
Code: [Select]
26fd9c: 6e20      ldr r0, [r4, #96] ; 0x60
is indeed (as philmoz observed in lapsers thread) the address of the instruction that should be loading the return value.

I'm assuming that the Lres values matches what see in the buflog for 'lua_resume %x" in this code:
Code: [Select]
  status = luaD_rawrunprotected(L, resume, L->top - nargs);
  if (status != 0) {  /* error? */
    char s[32];
    sprintf(s,"lua_resume %x",status);
    buflog_put(s);

If so then is it possible that the return address from luaD_rawrunprotected is being changed somehow?
(Clutching at straws again here - I just can't see any possible way that what is happening could be happening.)

Perhaps save more data on either side of the 'lj' memory to capture the return address from luaD_rawrunprotected.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline srsa_4c

  • ******
  • 4451
Re: help: how to debug script interrupting?
« Reply #132 on: 28 / April / 2013, 16:52:04 »
I have a few questions.
- Is this only happening on newer DryOS cameras?
- Does it matter whether the script is run in play or rec mode?

And a few things that were not mentioned yet (just ideas, not more).
- changing the priority of the keyboard task
- trying CHDK compiled as ARM
- changing gcc optimization http://chdk.setepontos.com/index.php?topic=265.msg96978#msg96978

*

Offline reyalp

  • ******
  • 14128
Re: help: how to debug script interrupting?
« Reply #133 on: 28 / April / 2013, 22:16:38 »
I have a few questions.
- Is this only happening on newer DryOS cameras?
This is unclear. I think it was reported on a620 in http://chdk.setepontos.com/index.php/topic,4131.msg39296.html#msg39296
Quote
- Does it matter whether the script is run in play or rec mode?
Unknown. I wasn't able to reproduce it with my a540 just sitting in play mode running a sleep(10) script for over 24 hours. I haven't tried the shooting script that works on D10 yet.

Quote
- changing the priority of the keyboard task
It seems like this has a high risk of undesirable side effects, since our kbd_task also implements the canon kbd_task, and uses the original priority value.
Quote
- trying CHDK compiled as ARM
Possible. Another idea would be to make a thumb-only version of setjmp/longjmp, since those appear to be the only thing that makes luaD_rawrunprotected unique.

Whatever it is happens in one specific location over a huge range of different builds, so it's hard to see it being a general problem with thumb.
Quote
- changing gcc optimization http://chdk.setepontos.com/index.php?topic=265.msg96978#msg96978
If it's a gcc optimization problem, it should show up in the dis-assembly of the affected area somehow. But the code seems fine...
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 14128
Re: help: how to debug script interrupting?
« Reply #134 on: 28 / April / 2013, 22:24:10 »
I'm assuming that the Lres values matches what see in the buflog for 'lua_resume %x" in this code:
That's where I got it from, I didn't actually check the final lua_resume result. But they matched on lapsers tests.

Quote
If so then is it possible that the return address from luaD_rawrunprotected is being changed somehow?
Hmm, so it doesn't land on the if ? I suppose, although that doesn't explain the contents of the return value.

Quote
Perhaps save more data on either side of the 'lj' memory to capture the return address from luaD_rawrunprotected.
That's not a bad idea.
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3450
    • Photos
Re: help: how to debug script interrupting?
« Reply #135 on: 29 / April / 2013, 01:16:23 »
I'm assuming that the Lres values matches what see in the buflog for 'lua_resume %x" in this code:
That's where I got it from, I didn't actually check the final lua_resume result. But they matched on lapsers tests.

Quote
If so then is it possible that the return address from luaD_rawrunprotected is being changed somehow?
Hmm, so it doesn't land on the if ? I suppose, although that doesn't explain the contents of the return value.

Quote
Perhaps save more data on either side of the 'lj' memory to capture the return address from luaD_rawrunprotected.
That's not a bad idea.

Some other things it might be worth capturing (not sure how feasible these are):
- CPSR register to see if the code is executing ARM or THUMB more (although I would expect a crash if the camera was in the wrong mode)
- checksum of the Lua module code (not data) to see if something is altering any instructions (a full RAM dump might also give this)

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline reyalp

  • ******
  • 14128
Re: help: how to debug script interrupting?
« Reply #136 on: 29 / April / 2013, 01:53:46 »
Some other things it might be worth capturing (not sure how feasible these are):
- CPSR register to see if the code is executing ARM or THUMB more (although I would expect a crash if the camera was in the wrong mode)
I think it has to be in thumb mode, since all the debug code I've added works fine. I have thought about triggering an exception when the error state occurs, so we'd get all the registers.

Quote
- checksum of the Lua module code (not data) to see if something is altering any instructions (a full RAM dump might also give this)
Another good idea. Probably just dumping out the module would be simplest way to go.

Attached is my latest test patch.
This will attempt to write a dump and then ignore the error by resetting status if it looks like the weird address value (but only the first 10 times). It also saves some of the stack when it saves lj.

I'm currently running this on both my cameras, using outsliders script in no-shoot mode. I have no idea if it will trigger if not actually shooting, but I guess it's worth finding out.

edit:
A540 ran for 96 minutes, 1300 shoot cycles, no error
D10 ran for 110 minutes, 1987 shoot cycles, no error

I'll try more tomorrow
« Last Edit: 29 / April / 2013, 03:22:05 by reyalp »
Don't forget what the H stands for.

*

Offline lapser

  • *****
  • 1093
Re: help: how to debug script interrupting?
« Reply #137 on: 29 / April / 2013, 10:12:08 »
It's possible that the bug is caused by an interrupt that doesn't restore the registers correctly. Since the errors only occur while shooting, maybe the interrupt is the one that closes the shutter at a precise time, assuming that's how the shutter is closed. It's possible that this interrupt happens during a normal, 10 msec tick count interrupt.

waterwingz has discovered errors in the capt_seq_hook_set_nr() hook for the sx50. I've only been able to trigger the error with the sx50 and sx260, but not with the G1X. So it's probably worth it to look closely at capt_seq_hook_set_nr() in cameras that trigger the bug.

http://chdk.setepontos.com/index.php?topic=8932.msg99906#msg99906
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

Re: help: how to debug script interrupting?
« Reply #138 on: 29 / April / 2013, 11:10:12 »
waterwingz has discovered errors in the capt_seq_hook_set_nr() hook for the sx50. I've only been able to trigger the error with the sx50 and sx260, but not with the G1X. So it's probably worth it to look closely at capt_seq_hook_set_nr() in cameras that trigger the bug.
I think its more likely a cut&paste porting error than related to this.  After all, the SX130 was were this first came up and I believe dark frame subtraction works there.

The SX50 bug basically means that you can't disable dark frame subtraction with the code the way it is,  and it actually crashes the camera in "Sports" mode.

I submitted a patch to disable it (no big deal as it doesn't work - just causes crashes) while I dig deeper.  Either the nrflag is wrong in the sigfinder or I'm going to need to go deeper into the subroutines to find a better place to patch it.
Ported :   A1200    SD940   G10    Powershot N    G16

*

Offline reyalp

  • ******
  • 14128
Re: help: how to debug script interrupting?
« Reply #139 on: 29 / April / 2013, 12:49:54 »
It's possible that the bug is caused by an interrupt that doesn't restore the registers correctly.
Yes, variations of this have been on my mind for a long time, but then the question is: how does it know to only do it in this one part of the Lua code? If there were a general problem with interrupts (given that the Canon firmware doesn't use thumb it isn't totally implausible that some corner case makes interrupts in thumb code misbehave) it should be wreaking havoc all over the place. I'm quite certain that more execution time is spent elsewhere (the inner guts of the Lua code, for example) and a whole lot of it would misbehave if the return values randomly changed values.

IMO there has to be something very specific about luaD_rawrunproctected. This is why I keep coming back to setjmp, but I don't see how. Moreover, whatever it is isn't very sensitive to changes such as adding debug code in various places, different compilers, Lua in a module or not etc.
Quote
Since the errors only occur while shooting,
This isn't certain. Alarik reported (earlier in this thread) that a simple sleep script would encounter the same problem. He had 60 cameras to check in parallel, so he may have seen it even if the rate of occurrence was a lot lower.
Quote
It's possible that this interrupt happens during a normal, 10 msec tick count interrupt.
kbd_task is a task that's scheduled like any other, it's not strictly tied to an interrupt every 10ms.
Don't forget what the H stands for.

 

Related Topics


SimplePortal © 2008-2014, SimplePortal