Shot Histogram Request - page 29 - CHDK Releases - CHDK Forum

Shot Histogram Request

  • 467 Replies
  • 128399 Views
*

Offline reyalp

  • ******
  • 14080
Re: Shot Histogram Request
« Reply #280 on: 13 / April / 2013, 16:47:45 »
Advertisements
I did another test this morning with the sx260 (2 lua threads) and sx50 (1 lua thread). The sx260 worked normally, but the sx50 crashed (power off, lens extended).
One thing that might be useful is to keep a detailed list of how long each build ran and whether it hit the bug.

Quote
One thing interrupts do is use extra stack space. I added a function call in capt_seq_hook_set_nr() that use stack, as well as a function call in raw_savefile(). If an interrupt happened at a spot when CHDK code was deeper into the stack for that task than the camera ever gets without CHDK, that might cause a stack overflow and trash whatever memory is next to the stack for that task. I don't know the internals, so maybe you can think of a better idea.
This is possible, but the we modify the kbd_task to have a much larger stack. In general I would say if you are running simple lua code it's very unlikely you would use the whole 8K of stack. An interrupt would basically just push the registers on the stack, which is roughly ~ 128 bytes. If this were the cause, it seems quite unlikely it would cause exactly the same symptom across many different builds, scripts and cameras. There should be enough variation that stack corruption would cause all kinds of different crashes, hangs etc. One extra stack variable or function call in the entire stack should change the behavior significantly.

This problem does reek of memory corruption *somewhere* but if so, it has to be very tightly constrained to some specific code / offset. The address identified philmoz earlier may be a clue.
Quote
I also compiled with your modules makefile and added lua.elf.dumpobj to the zip. That made it too big to attach, so it's posted here:
Can you clarify whether this objdump is from the same elf that was actually running on the camera? If it wasn't built from exactly the same source code, then the value is limited.

Note that in the future you can just post the .elf (which is much smaller) and I can disassemble it myself.
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 14080
Re: Shot Histogram Request
« Reply #281 on: 13 / April / 2013, 16:55:30 »
The romlog is the fsionotify thing, which means we tried to have too many file handles open at once.

Also, I should have requested core/main.elf instead of main.bin. If I disassemble main.bin, I don't get symbols which will makes things a lot harder to follow. My bad :(

edit:
where did the "TLUARES1.DMP" name come from? My code should only create LUARES*, without a T on the front. If this got manually renamed in copying, that's fine. If it wasn't, we have another mystery....

edit:
Looking at the romlog, it appears that the second dump with the execution trace was the one that triggered the assert (0x0030D57D <- lua + 43f5 - inside dbg_dump_write). Bummer.

One odd thing in the ROMLOG, there are several bunch of instances of 0x00310C6E which is lua+ 7AE6. This is very close to the lres result value Phil noted earlier. I've attached my annotated romlog.
« Last Edit: 13 / April / 2013, 17:44:50 by reyalp »
Don't forget what the H stands for.

*

Offline reyalp

  • ******
  • 14080
Re: Shot Histogram Request
« Reply #282 on: 13 / April / 2013, 17:55:03 »
I have to do some other stuff today, but it would be helpful if I could build the code that you are actually going to run, so we don't have so much back and forth trying to get the right dumps and elfs.

If you can provide your current code, I can work off of that. Or if you can run stock CHDK for testing, I can provide builds of that.
Don't forget what the H stands for.

*

Offline philmoz

  • *****
  • 3450
    • Photos
Re: Shot Histogram Request
« Reply #283 on: 13 / April / 2013, 20:02:50 »
To be accurate the source code needs to match what you used when you created the dump files.
Hmm, I'm not confident I can do that. I've also added reyalp's new patch since then. However, I do have some new data.

I did another test this morning with the sx260 (2 lua threads) and sx50 (1 lua thread). The sx260 worked normally, but the sx50 crashed (power off, lens extended).

One thing interrupts do is use extra stack space. I added a function call in capt_seq_hook_set_nr() that use stack, as well as a function call in raw_savefile(). If an interrupt happened at a spot when CHDK code was deeper into the stack for that task than the camera ever gets without CHDK, that might cause a stack overflow and trash whatever memory is next to the stack for that task. I don't know the internals, so maybe you can think of a better idea.

I created a zip with the romlog from the sx50 and the TLUARES1.DMP file. The time stamp on this file is the same as the last jpg picture before the crash. I had the backlight off with my backlight patch, which put a lot of turnbacklightoff calls into the romlog, so that's not the problem.

I also compiled with your modules makefile and added lua.elf.dumpobj to the zip. That made it too big to attach, so it's posted here:

http://www.adrive.com/public/b7bfgJ/ReyalpLogs_130413.zip



Do you have your script log file from this crash so we can see the Lres value?

Phil.
CHDK ports:
  sx30is (1.00c, 1.00h, 1.00l, 1.00n & 1.00p)
  g12 (1.00c, 1.00e, 1.00f & 1.00g)
  sx130is (1.01d & 1.01f)
  ixus310hs (1.00a & 1.01a)
  sx40hs (1.00d, 1.00g & 1.00i)
  g1x (1.00e, 1.00f & 1.00g)
  g5x (1.00c, 1.01a, 1.01b)
  g7x2 (1.01a, 1.01b, 1.10b)


*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #284 on: 13 / April / 2013, 20:24:50 »
Also, I should have requested core/main.elf instead of main.bin.

where did the "TLUARES1.DMP" name come from?
Thanks for all your help. Yes, the files come from the same SX50 build that was running when the crash happened. Here's the main.elf file from that build:
http://www.adrive.com/public/tvG7fW/main.elf

I added the T in "TLUARES1.DMP" with the file browser so it wouldn't be overwritten.  Then I restarted the camera and script to continue the test. So at least one mystery is solved (sorry, I forgot about doing that.)

When the script is running, the camera is saving pictures very rapidly and the script is writing the log file at the same time. I print out the time that the print(...) statement takes, and it's usually 20 msec. Sometimes, though, it's much longer, 1000 msec or more even. So it's possible writing the dump files could overload the file system since it's already at maximum capacity.

I wonder if this has anything to do with the error.I don't recall ever triggering the error with logging off, but it would be hard to tell what stopped the script without the log. I wonder how Lua handles a print(...) statement when it takes 1000 msec or more to return, and if that would cause a problem.

I've attached the script log that was being written when the camera crashed. Interestingly, the last picture file number in the log was the last picture written to disk before the crash. Usually, the log misses a few at the end with a camera crash because the log file is never closed. It looks like writing the .DMP file must have flushed the print log.

@Phil, the script log is attached, but it crashed before it printed the lres value.


EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #285 on: 14 / April / 2013, 14:31:43 »
I added an option to the script to use wait_click(10) or wait_shot_ready(). wait_shot_ready() avoids the multiple, rapid lua yield/resume() which triggers the bug, so I'll use it unless I'm doing a bug test.

Don't get bogged down with this if you have other things to do. I think we may need to find a way to trigger the bug more often, and on your cameras, if we're going to figure this one out.
=====
I updated the script to improve the "glitch detector" that ignores a large meter change until it happens twice in a row. I've had glitches for some reason, in my long night time lapses that cause a flash for no apparent reason.  The meter reading is off on one shot even though the picture looks the same as the last one.
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #286 on: 16 / April / 2013, 02:10:05 »
I finally got the SX50 to trigger the bug again. This time, I set the interval to 1.5 seconds instead of 1 second, so the file system wasn't as busy saving pictures, and it didn't crash the camera writing reyalp's logs.

So I captured reyalp's new lua log file for the first time. I've also included all the relevant files you requested before. Hopefully, this will provide some useful info this time.

http://www.adrive.com/public/BjmcjC/ReyalpLogs_130415.zip
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #287 on: 16 / April / 2013, 02:38:12 »
I think I have the brightness smoothing working really well now. Here's the latest video, taken with the G1X. Also, I haven't had the G1X trigger the Lua Bug we've been talking about, only the SX50 and SX260.

http://www.youtube.com/watch?v=ovTpGkKBFhU#ws
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos


*

Offline reyalp

  • ******
  • 14080
Re: Shot Histogram Request
« Reply #288 on: 17 / April / 2013, 00:37:53 »
So I captured reyalp's new lua log file for the first time. I've also included all the relevant files you requested before. Hopefully, this will provide some useful info this time.
This captures the trace log. It's doesn't solve the mystery, but it certainly confirms the weirdness
Code: [Select]
>lua_resume
>luaD_rawrunprotected
>resume
resume yield
>lua_yield
<lua_yield
<resume
<luaD_rawrunprotected
lua_resume 310252
0x310252 is the Lres value (confirmed in the script log). "resume yield" means resume is resuming from a yield (rather than starting the script).

Taking the lua load address as 0x309178, this makes the lres 0x70da in the lua module, which is not the same as before, but not far off. According to lua.elf.objdump, this would correspond to
Code: [Select]
000070d8 <luaD_reallocstack>:
    70d8: b5f0      push {r4, r5, r6, r7, lr}
    70da: 6a03      ldr r3, [r0, #32]
This it doesn't seem like 70da would end up on the stack as a return address (and it should be thumb if it was) or as a pointer in the code. Note reallocstack refers the Lua stack, not the C stack.

Very strange.

I've have some ideas for further debugging, but am short on time at the moment.
Don't forget what the H stands for.

*

Offline lapser

  • *****
  • 1093
Re: Shot Histogram Request
« Reply #289 on: 17 / April / 2013, 01:00:28 »
Taking the lua load address as 0x309178, this makes the lres 0x70da in the lua module, which is not the same as before, but not far off.
I'm using a new trunk 2698 in this build. I think Phil made some changes in the Lua module in this trunk.
https://trac.assembla.com/chdk/changeset/2698

Did this change really save 10K when running Lua? That's great. Now I don't feel as bad for wanting to use 8K for the 12-bit histogram instead of 2K.

Before the Lua update to 5.5, the Lres value was apparently a pointer to a string on the Lua stack, because the error message would be "display" or "set", which came from the script calling is_key("display") or is_key("set") before the lua yield that triggered the bug.

====
There's no rush on this. I've modified the script to be able to use wait_click(10) to try to trigger the bug, or my new wait_shot_ready() to avoid the bug. I'm running the G1X overnight with wait_click(10) tonight. I haven't seen the bug with the G1X since the Lua update.
EOS-M3_120f / SX50_100b / SX260_101a / G1X_100g / D20_100b
https://www.youtube.com/user/DrLapser/videos

 

Related Topics