It happened pretty soon (46 cycles). Without looking into details, is it normal for the script to stop on this condition?
If it hits the bug, it should continue, and just increment the bork_count global (b in misc debug vals) and write res1.dmp.
If the script died, then it isn't the same as what I've been seeing.
I wonder whether one of the following could have something to do with this issue:
- bug in a system (Canon) function. There have been lots of "Canon bugs" reported on the ML forum lately (don't know what caused them to surface, only saw the reports).
- the task in which context Lua runs blocks some other task, which fails to update something that "we" rely on. Already tried to mention this issue before (as "gcc optimization").
I'm open to both these options, but I don't see any specific mechanism or theory to test.
As far as I understand the bug:
In luaD_rawrunprotected
L->errorJmp = lj.previous; /* restore old error handler */
buflog_put("<luaD_rawrunprotected");
memcpy(&buflog.lj,&lj,sizeof(lj));
return lj.status;
The copy of lj saved with memcpy here has status 0
Yet in lua_resume
status = luaD_rawrunprotected(L, resume, L->top - nargs);
if (status != 0) { /* error? */
status is not 0, instead, it corresponds to the address of the instruction that put lj.status into r0, for e.g.
26fd9c: 6e20 ldr r0, [r4, #96] ; 0x60
the status seen in lua_resume is 0x26fd9c
If this was a compiler bug, then it should show up in the disassembly. Nothing stands out, and it's very hard to see how it would work a million times and then fail.
If it's a Canon bug, then it has to be something has to be something asynchronous, like interrupts or task context switching. But if those are broken, how does it only happen in this spot?
We know it's not corruption of the code, because if we ignore the bogus return value, it can run thousands of iterations more without problems.
One thing I did notice which seems almost plausible:
The return from luaD_rawrunprotected (in my builds and the autobuild) looks like this
27ccb8: ab03 add r3, sp, #12
27ccba: b01c add sp, #112 ; 0x70
27ccbc: 6e18 ldr r0, [r3, #96] ; 0x60
27ccbe: bc10 pop {r4}
27ccc0: bc02 pop {r1}
27ccc2: 4708 bx r1
If something like an interrupt handler used the kbd_task stack (pushing stuff onto it and restoring it later) between the add sp and the ldr, then it could clobber the value actually loaded into r0. However, ARM does not generally do this: interrupt handlers get their own independent banked copy of SP. The fact that the compiler generates code like this suggests it's legal in normal ARM environments. Given that the compiler does generate it, we'd also expect the same pattern to occur elsewhere, though browsing the disassembly of the lua module, it doesn't seem terribly common. I thought it might special because of the volatile on lj.status, but taking that out didn't appear to affect the generated code.
I think this is very unlikely, but it's possible to test by saving a copy of the stack above SP, right after the return to lua_resume. I'm trying to run this now, but haven't triggered the bug with that code yet.