help: how to debug script interrupting? - page 15 - General Discussion and Assistance - CHDK Forum  

help: how to debug script interrupting?

  • 224 Replies
  • 43462 Views
*

Offline reyalp

  • ******
  • 13619
Re: help: how to debug script interrupting?
« Reply #140 on: 29 / April / 2013, 15:53:16 »
Advertisements
A small update. The patch attached below shows how many times the bug has been hit in the misc debug vals, item labeled "B"

Running this on D10, with shooting enabled, it hit the bug around 612 shots, and seems to have continued fine. Currently at over 1200 shots

Running on a540, shooting disabled it's gone over 2100 cycles without hitting the bug.

edit:
Update
D10 has hit the bug and successfully ignored it 6 time in ~5600 shots.
A540 has done >7200 cycles without hitting the bug.

edit:
A540 ran 10000 no-shoot cycles without triggering the bug. Total run time was just a bit over 12 hours.
From this I concluded that shooting is required or greatly increases the chances of hitting the bug, or A540 is not subject to the problem. I'll try running the a540 with shooting enabled tomorrow.

Lua memory usage bounced around from 53kb to 97kb. The 97kb peak occurred on the 13th shot, while the last instance of 53k was on the 9558th shot, so I'd say garbage collection is working as expected.

D10 is still running currently at > 7500 shots and 8 instances of the bug ignored.
« Last Edit: 30 / April / 2013, 02:00:39 by reyalp »
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3376
    • Photos
Re: help: how to debug script interrupting?
« Reply #141 on: 30 / April / 2013, 04:24:08 »
Running this on D10, with shooting enabled, it hit the bug around 612 shots, and seems to have continued fine. Currently at over 1200 shots

Running on a540, shooting disabled it's gone over 2100 cycles without hitting the bug.

edit:
Update
D10 has hit the bug and successfully ignored it 6 time in ~5600 shots.
A540 has done >7200 cycles without hitting the bug.


Was the Lres value consistent in each case?

Quote

edit:
A540 ran 10000 no-shoot cycles without triggering the bug. Total run time was just a bit over 12 hours.
From this I concluded that shooting is required or greatly increases the chances of hitting the bug, or A540 is not subject to the problem. I'll try running the a540 with shooting enabled tomorrow.

I'd run the D10 without shooting since you know you can trigger the problem on that camera with some consistency.

This is good - at least some ability to reproduce the problem might help track it down.

Since you can ignore the problem and it doesn't happen again for a while then it's not likely the Lua code has been altered in memory.

Any additional clues in the dumps? Perhaps it's time to add the full memory dump.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)

*

Offline reyalp

  • ******
  • 13619
Re: help: how to debug script interrupting?
« Reply #142 on: 30 / April / 2013, 16:56:55 »
Final result for the D10 was 8 errors in 10k shots. I haven't had time to analyze much yet.

One reason I let it run all the way through is to get some statistical sense of how long we'd need to run to prove we fixed the problem. Interestingly, the last stretch was from at least 6390 or less to 10k without an error, indicating that it can go up to at least 3500 shots or so without hitting it. Up to that point, it was averaging closer to 1 error in 750 shots. The late part of the run had lower light and cooler temperatures, which might be a factor I guess.

I should be able to work out exactly which exposures the errors occurred in from the dumps.

Quote
Was the Lres value consistent in each case?
I haven't looked at the dumps yet, but the check to ignore the value was
Code: [Select]
if(status > (int)luaD_rawrunprotected && status < (int)lua_resume) {
So it's safe to say it's very similar to the ones we've seen in the past. We could narrow this a lot more...

I started the A540 shooting today, it's gone 3900 without hitting the error.

edit:
All the Lres values as before, pointing to the instruction that loads the return value.

Exact times and exposure counts for the errors
Code: [Select]
res1
time 2013:04:29 11:25:33 (1367234733) tick  4584060
== photo: 587 ==
Mon Apr 29 11:25:31 2013 (1367234731)
 -presd shoot_half:4583020
 - start sleep:     4584020
 - test get_sht:    4584180


res2
time 2013:04:29 13:15:22 (1367241322) tick 11173960
== photo: 1448 ==
Mon Apr 29 13:15:21 2013 (1367241321)
 -presd shoot_half:11172870
 - start sleep:     11173920
 - test get_sht:    11174120

res3
time 2013:04:29 15:02:09 (1367247729) tick 17580140
== photo: 2344 ==
Mon Apr 29 15:02:08 2013 (1367247728)
 -presd shoot_half:17579990
 - test get_sht:    17580130
 - start sleep:     17580290


res4
time 2013:04:29 16:11:08 (1367251868) tick 21719930
== photo: 3135 ==
Mon Apr 29 16:11:07 2013 (1367251867)
 - shoot:           21719860
 - start counter:   21722290

res5
time 2013:04:29 17:19:27 (1367255967) tick 25818250
== photo: 4066 ==
Mon Apr 29 17:19:25 2013 (1367255965)
 - shoot:           25818190
 - start counter:   25820620

res6
time 2013:04:29 19:00:25 (1367262025) tick 31876110
== photo: 5448 ==
Mon Apr 29 19:00:24 2013 (1367262024)
 -presd shoot_half:31875470
 - start sleep:     31876030
 - test get_sht:    31876210

res7
time 2013:04:29 20:03:09 (1367265789) tick 35640380
== photo: 6127 ==
Mon Apr 29 20:03:08 2013 (1367265788)
-presd shoot_half:35639660
 - start sleep:     35640320
 - test get_sht:    35640490

res8
time 2013:04:29 20:26:19 (1367267179) tick 37030320
== photo: 6311 ==
Mon Apr 29 20:26:18 2013 (1367267178)
 - start sleep:     37030240
 - test get_sht:    37030420
First line after resN is from the dump, time and tick count
The remaining stuff is from the script log, showing where in the sequence the error occurred

Note for the ones in shoot_half, I've snipped out start sleep/ test lines after the -presd shoot_half, the errors are *not* usually occurring on the first sleep.

Initial observation is that mostly occurred in shoot_half (indicating the error should probably happen in no-shoot) but some happened in shooting as well.

Also, the number of shots to error varied from 184 to > 3689.

Aside, I'm very happy I put my dump decoding code in Lua:
Code: [Select]
!d={};for i=1,8 do d[i]=m.load('RES'..i..'.DMP');end
!for i,r in ipairs(d) do print(r:user_data_str():gsub('.*(lua_resume %x+\nignored).*','%1')) end

A540 is at > 4700 shot.
« Last Edit: 30 / April / 2013, 17:39:53 by reyalp »
Don't forget what the H stands for.

*

Offline srsa_4c

  • ******
  • 4437
Re: help: how to debug script interrupting?
« Reply #143 on: 30 / April / 2013, 17:34:38 »
Just would like to note that I left an a3200 running for 16 hours in play mode with a simple script (wait for keypress loop), no error. Also had a test run (24 hours) on an a410 with outslider's script (I used the remote capture branch to suppress JPEGs), no errors either...


*

Offline reyalp

  • ******
  • 13619
Re: help: how to debug script interrupting?
« Reply #144 on: 30 / April / 2013, 22:29:20 »
Just would like to note that I left an a3200 running for 16 hours in play mode with a simple script (wait for keypress loop), no error. Also had a test run (24 hours) on an a410 with outslider's script (I used the remote capture branch to suppress JPEGs), no errors either...
Just FWIW, the error occurs lua_resume at the point that the underlying code has yielded (or otherwise completed), so all else being equal, maximizing the number of yield/resume should trigger it more often. I'm not convinced all else is equal though, because I was never able to reproduce it on my D10 with a plain sleep(10) loop.

You could take this further by making something that just yields/resumes in a loop without waiting for a kbd_task iteration (or, say 1000 times per kbd_task iteration).

Per the information in the earlier post, it most frequently occurred in repeat sleep(50) until get_shooting() loop. It would be interesting to see how much changing the the sleep(50) to a sleep(10) might increase the frequency.
Don't forget what the H stands for.

Re: help: how to debug script interrupting?
« Reply #145 on: 30 / April / 2013, 22:36:36 »
Just FWIW, the error occurs lua_resume at the point that the underlying code has yielded (or otherwise completed), so all else being equal, maximizing the number of yield/resume should trigger it more often.
@reyalp :  if you can post a reference script that does this,  I'll bet you can get a whole lot of different people with different cameras to try it and report back. 

Sort of like seti@home for CHDK. 

I've got four cams (propset 2,3,4&5) ready to go.

WW.
Ported :   A1200    SD940   G10    Powershot N    G16

*

Offline reyalp

  • ******
  • 13619
Re: help: how to debug script interrupting?
« Reply #146 on: 30 / April / 2013, 23:10:47 »
A few more items
A540 completed 10k shots without hitting error

Going through the buflogs, I've identified why the camera yielded for each error.
1, 2, 6,7,8 were due to sleep()
3 was the yield hook (note, some day we should make the yield hook count reset on explicit yields...)
4, 5 were in shoot()

Bottom line, it doesn't appear to matter how we got to yield.
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3376
    • Photos
Re: help: how to debug script interrupting?
« Reply #147 on: 01 / May / 2013, 00:02:08 »
3 was the yield hook (note, some day we should make the yield hook count reset on explicit yields...)

It already gets reset in lua_script_run when the script resumes.

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)


*

Offline reyalp

  • ******
  • 13619
Re: help: how to debug script interrupting?
« Reply #148 on: 01 / May / 2013, 00:28:51 »
@reyalp :  if you can post a reference script that does this,  I'll bet you can get a whole lot of different people with different cameras to try it and report back. 
As I said, I suspect there's other factors. However, I've attached a version of outsliders script that lets you set the sleep interval. Setting it to 10 or less should give you the maximum number of sleeps.

Based on what I've seen from the D10, you probably don't need to actually shoot.

Note that if you have logging enabled, it will be *very* spammy, with 2 logs lines for each sleep. Since it flushes ever log line, this will also slow it down noticeably.

If you just want to see if the error happens, logging shouldn't be needed, unless something about the bug is connected to SD activity.

For other setting, I'd suggest delay 1, mode accurat
It already gets reset in lua_script_run when the script resumes.
So it is. I guess that means that the tick count check was hit. Since the default limit is 10ms, it's possible to enter lua_resume just before it increments.
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 13619
Re: help: how to debug script interrupting?
« Reply #149 on: 01 / May / 2013, 01:13:17 »
Updated version of the script adds two settings

buffer log - only writes the log once per shoot cycle. This is only good for debugging if you are running the patch that ignores the error, otherwise you will lose the interesting part of the log.

verbose log - if this is turned off, the individual sleep calls when waiting for get_shooting are not logged. Again, less useful if you aren't ignoring the error.
Don't forget what the H stands for.

 

Related Topics