Shot Histogram Request - page 24 - CHDK Releases - CHDK Forum

Shot Histogram Request

  • 467 Replies
  • 148403 Views
*

Offline lapser

  • *****
  • 1093
Lua Resume Bug Strikes Again
« Reply #230 on: 08 / April / 2013, 00:07:05 »
Advertisements
I thought Phil's update to Lua had fixed the resume bug. However, I triggered the bug again tonight. It seems to happen as it gets dark and the shutter time increases. Here's what it printed:

ERROR: NULL error message
*** TERMINATED ***
Lapser:     -9 3209878     0     0

The "Lapser" line is debug variables I defined. Only the first one is significant here.

lapser1 = -9

This is a lucky break, because the only time lapser1 is -9 is in luascript.c here:

Code: [Select]
// Wait for a button to be pressed and released (or the timeout to expire)
static int action_stack_AS_LUA_WAIT_CLICK()
{
lapser1=-9;
    // Check key pressed or timeout
    if ((get_tick_count() >= action_top(2)) || camera_info.state.kbd_last_clicked)
    {
        // If timed out set key state to "no_key", otherwise key pressed so set last checked time
        if (!camera_info.state.kbd_last_clicked)
            camera_info.state.kbd_last_clicked=0xFFFF;
        else
            camera_info.state.kbd_last_checked_time = camera_info.state.kbd_last_clicked_time;
lapser1=9;
        action_pop_func(1);
        return 1;
    }
    return 0;
}
So it looks like action_stack_AS_LUA_WAIT_CLICK() didn't return, or lapser1 would be 9, not -9. Interesting!

If this is true, it may be that the function, get_tick_count() never returned (EDIT or messed up the stack somehow?). I suspect that it's not thread safe. Here's the code for get_tick_count() in platform/generic/wrappers.c

Code: [Select]
long get_tick_count()
{
long t;
#if !CAM_DRYOS
    _GetSystemTime(&t);
    return t;
#else
    return (int)_GetSystemTime(&t);
#endif
}

I added a delay in capt_seq_hook_set_nr() that includes this code:

Code: [Select]
      tick=get_tick_count();
      while(tick<shot_next)
      {
        msleep(10);
        tick=get_tick_count();
      }
My theory is that there's a hard interrupt during the call to get_tick_count() above, and then get_tick_count() is called in wait_click(10) and messes up the stack.

But it looks like I could change my wait loop to avoid get_tick_count() like this:

Code: [Select]
      tick=get_tick_count();
      if(tick<shot_next)
      {
        msleep(shot_next-tick);
        tick=shot_next;
      }
I was using msleep(10) so I cold abort the wait by setting shot_next to 0 when I exit the program, but I can minimize the calls to get_tick_count() if that's will fix the bug.

So this may not be a "lua yield" bug after all. It may be that get_tick_count() is called from different threads but isn't thread safe.

So if this is the cause of the bug, I may be able to trigger it more reliably by calling get_tick_count() in a tight loop many times, in both threads. I'll try it and see what happens.
====
[EDIT 2]
The SX260 triggered the same bug about an hour later:

ERROR: NULL error message
*** TERMINATED ***
Lapser:     -9 3146878     0     0

The second number is a little different and looks like an address. Here's where "lapser2" comes from:

Code: [Select]
int lua_script_run(void)
{
//snip
  Lres = lua_resume( Lt, top );
  lapser2=Lres;
Here's are the values for lapser2:

SX50:   3209878  0x30FA96
SX260: 3146878  0x30047E
« Last Edit: 08 / April / 2013, 01:04:20 by lapser »
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

*

Offline reyalp

  • ******
  • 14126
Re: Shot Histogram Request
« Reply #231 on: 08 / April / 2013, 00:59:18 »
I don't understand how get_tick_count() not being thread safe could cause the symptom.

The "null error message" means that problem happened while lua was executing, inside lua_resume. Of course,  the lua state could have got messed up before the call to lua_resume.

If get_tick_count() didn't return, then the entire kbd_task would be hung, and lua would never resume, or (if it was the call inside luaCB_wait_click) lua_resume would never return.

unless I'm mistaken, lapser1 would stay -9 if the return 0 (timeout) case were hit.

Looking at the implementation of  _GetSystemTime (on D10) I don't see how it would not be thread safe. It's a very simple function, it basically just disables interrupts, reads a variable, restores interrupts and returns.

There is an oddity in luaCB_wait_click though, it returns 1 even though it pushes nothing on the Lua stack. That is wrong, although I suspect it is harmless.

edit: I've fixed a few instances of that mistake in the trunk and release branches.
« Last Edit: 08 / April / 2013, 01:42:00 by reyalp »
Don't forget what the H stands for.

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #232 on: 08 / April / 2013, 01:45:14 »
unless I'm mistaken, lapser1 would stay -9 if the return 0 (timeout) case were hit.
  lapser1=9;
is right before
  action_pop_func(1);

It should have stayed in the action stack loop and not called resume. So something else must have moved the stack pointer, I think?
Quote
Looking at the implementation of  _GetSystemTime (on D10) I don't see how it would not be thread safe. It's a very simple function, it basically just disables interrupts, reads a variable, restores interrupts and returns.
That's good to know that _GetSystemTime is threadsafe (disables interrupts). However, I don't think the wrapper is threadsafe, since it could be interrupted and called again from another thread.

  long t;
  return (int)_GetSystemTime(&t);
is actually 2 returns. If an interrupt occurs between the returns, that could be a problem. If the stack pointer were in the wrong place because of it, it's possible that it would effectively do the action stack pop thinking it was popping variable "t".
Quote
There is an oddity in luaCB_wait_click though, it returns 1 even though it pushes nothing on the Lua stack. That is wrong, although I suspect it is harmless.
I think Phil changed the meaning of the 1 return when he re-worked action stack.
[EDIT] Oh, you're talking about luaCB_wait_click(). Yes I noticed that too, but I figured Phil knew what he was doing  :D

In any event, this suggests a way try to trigger the bug with repeated calls to get_tick_count() from 2 different "tasks". The hardest part about this bug is that it only happens every 10,000 pictures!

One thing that bothers me is that the bug also happened with coroutine.yield(), as I recall. I don't think that calls get_tick_count(), but get_tick_count() might be called somewhere else in the action stack code. But I think having the action stack abort abnormally and call resume() might explain the strange error code.

Thank you for your help.
« Last Edit: 08 / April / 2013, 01:53:12 by lapser »
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

*

Offline reyalp

  • ******
  • 14126
Re: Shot Histogram Request
« Reply #233 on: 08 / April / 2013, 01:52:54 »
Per http://www.lua.org/manual/5.1/manual.html#lua_resume
lua resume is supposed to return a limited number of constants
Code: [Select]
/* thread status; 0 is OK */
#define LUA_YIELD 1
#define LUA_ERRRUN 2
#define LUA_ERRSYNTAX 3
#define LUA_ERRMEM 4
#define LUA_ERRERR 5
You might check whether the values you got are within the CHDK binary or heap space.

If you want to try to drill down to where the bad status is coming from, ldo.c lua_resume would be the place. I mentioned back here http://chdk.setepontos.com/index.php?topic=8997.msg97669#msg97669 that I traced it into a particular case in this function (assuming Alarik's problem was the same as yours).
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3450
    • Photos
Re: Shot Histogram Request
« Reply #234 on: 08 / April / 2013, 01:53:08 »
While I doubt it's an issue of 'thread safeness' there does seem to be something strange here.

Just to clarify, the line:
    Lapser:     -9 3209878     0     0
is being displayed in the console after the script terminates, correct?

The script should have been suspended until action_stack_AS_LUA_WAIT_CLICK finished (after the action_pop_func call).

So I want to double check that the '-9' being displayed is not a result of the simply not updating the display/log after the script has ended.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #235 on: 08 / April / 2013, 01:58:50 »
The script should have been suspended until action_stack_AS_LUA_WAIT_CLICK finished (after the action_pop_func call).

So I want to double check that the '-9' being displayed is not a result of the simply not updating the display/log after the script has ended.
Yes, that's what caught my attention. Here's where I print lapser1:

Code: [Select]
int lua_script_error(lua_State *Lt,int runtime)
{
//snip

    script_console_add_line(LANG_CONSOLE_TEXT_TERMINATED);

    char osd_buf[50];
sprintf(osd_buf, "Lapser: %6d%6d%6d%6d",lapser1,lapser2,lapser3,lapser4);
script_console_add_line((long)osd_buf);


script_print_screen_statement(0); //close print file
    shot_histogram_set(0); //disables meters and interval too

    return SCRIPT_RUN_ERROR;
}

EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

*

Offline reyalp

  • ******
  • 14126
Re: Shot Histogram Request
« Reply #236 on: 08 / April / 2013, 01:59:51 »
That's good to know that _GetSystemTime is threadsafe (disables interrupts).
It's not thread safe because it disables interrupts. It's thread safe because it only reads globals and uses the stack. Even if some other thread was changing the globals underneath it, that wouldn't cause the code to malfunction.

Quote
However, I don't think the wrapper is threadsafe, since it could be interrupted and called again from another thread.

  long t;
  return (int)_GetSystemTime(&t);
is actually 2 returns. If an interrupt occurs between the returns, that could be a problem. If the stack pointer were in the wrong place because of it, it's possible that it would effectively do the action stack pop thinking it was popping variable "t".
This is nonsense. Each task has it's own stack, and the OS handles switching them. t is a stack variable, so every thread gets it's own, although if you look at the dryos version of _GetSystemTime, it doesn't even use t ;) (edit: well, the D10 version doesn't, some others might...)

Quote
I think Phil changed the meaning of the 1 return when he re-worked action stack.
No, phil did not change the semantics of lua C functions. In a lua C function, (one that is called from lua, meaning all the luaCB functions in the CHDK API), the return value is the number of values that function returns in Lua. The actual values are pushed on the Lua stack.
« Last Edit: 08 / April / 2013, 02:06:10 by reyalp »
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 14126
Re: Shot Histogram Request
« Reply #237 on: 08 / April / 2013, 02:16:46 »
To me, the fact that Lres (=lapser2) contains a nonsense value (which looks like an address) should be a big hint. The lua state is getting corrupted somewhere, which matches the strange debug dumps I got from Alarik. get_tick_count is a red herring, IMO.
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3450
    • Photos
Re: Shot Histogram Request
« Reply #238 on: 08 / April / 2013, 02:30:24 »
The only way I can see that lapser1 could be displayed as -9 is if the error occurred calling luaCB_wait_click.

The luaCB_wait_click function calls action_stack_AS_LUA_WAIT_CLICK which will set lapser1 to -9.
Since action_stack_AS_LUA_WAIT_CLICK would have returned 0, then luaCB_wait_click must have called return lua_yield(L,0).
The Lua error must have then been triggered from this return - otherwise action_stack_AS_LUA_WAIT_CLICK would have taken over the action stack and the script would have been suspended.

Phil.

CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #239 on: 08 / April / 2013, 02:49:36 »
The only way I can see that lapser1 could be displayed as -9 is if the error occurred calling luaCB_wait_click.

The luaCB_wait_click function calls action_stack_AS_LUA_WAIT_CLICK which will set lapser1 to -9.
Since action_stack_AS_LUA_WAIT_CLICK would have returned 0, then luaCB_wait_click must have called return lua_yield(L,0).
The Lua error must have then been triggered from this return - otherwise action_stack_AS_LUA_WAIT_CLICK would have taken over the action stack and the script would have been suspended.

Phil.
Yes, that makes sense. I'm calling wait_click(10), so it's probably just yielding right away? Anyway, the bug also occurs with sleep(10) and coroutine.yield(), so I guess we haven't learned much new.

Sorry for the false alarm, but I thought maybe we had something. I'll have to keep using wait_shot_ready() and avoid yield() as much as possible until we figure it out, I suppose.
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

 

Related Topics


SimplePortal © 2008-2014, SimplePortal