Being mostly over last week's cruddy virus, I'm putting in some work time this weekend to try to get caught up.
I'd had just about enough productivity last week to deal with the high-priority issues that had cropped up as a client was putting on a dog-and-pony show, trying to get the end customer to sign off on one of the more ambitious toasters*.
Anyway, one of the things that had turned up was a panic-and-reboot cycle that appeared in their lab and mine at the same time, and didn't appear to be related to any changed I or my apprentice had made. Turned out that I'd goofed up a bit definition, such that a model-specific bit reflecting the status of the overdone-toast detector** was stepping on the generic bit indicating a major self-test failure, and we both had the signal turned up on our simulators such that the toast appeared to be burning at startup. While a burning-toast indication at startup arguably does constitute a self-test failure, this was not the behavior we wanted, and correcting the bit definition made the symptoms go away.
Well, I was still curious about the cause of the panic, and was much afeared that I'd be diving into the innards of the OS trying to find it. Turns out, though, that it was a silly little thing: this model also has an LCD panel to display status, which includes the current operating state (by a highly condensed name). Well... when I'd added a new state for "inoperable due to self-test failure", in support of another model, I'd updated the generic table of full-length state names, but not this model's table of condensed names. So, it was trying to use the contents of an unrelated word in flash as a string pointer, resulting in the observed dabort trap.
And so the code gradually improves. Some of the recent fixes affect other current models; others add new capabilities that will likely be of use in future models. One fix involved diving into the FPGA code again, and had me regretting designing in a Spartan 3E instead of a Spartan 6***.
Also: when I returned to the project today, the test that had been running since yesterday morning had rebooted with an Out Of Cheese Error. I updated the status monitor to include the cheese**** level, and it's been holding steady. Or, rather, flickering among 23K, 23.4K, and 23.8K bites of cheese available (cheese allocation being highly dynamic); the level not progressively diminishing, there would seem not to be a cheese leak. Could have been a secondary effect of disruption caused by EMI coupled into the wiring (a known problem with the lab setup; not so much with the actual appliance).
Update: Got the Out Of Cheese Error again (with the panic, system dump, and Redo From Start). Last iteration of the status display showed 23448 bites of cheese available, so it's something that happens suddenly, not a gradual leak. The active task at the time was http_server (it's a net-connected toaster)... which shouldn't have been allocating any large pieces of cheese. Could be the Big Cheese got all full of holes so a modest-sized slice couldn't be allocated, or maybe... hmph. I gotta figure out how to do a stack trace in ARM EABI, to find out who made the request (could have been an interrupt handler, or the task switcher, rather than the server task). Meanwhile, I guess I could throw in a cheese-list dump when that error occurs.
Update 2: The available cheese isn't getting all Swiss; it's getting all crumbly. Apparently adjacent curds are failing to fuse together as they should. Which means there's one more freakin' special case my cheese-release function needs to handle. Er. Or... D'OH! I need to get rid of an else, or perhaps make the merge iterative. Putting a bit of cheese back into the ball checks for merge-before and merge-after, but if the piece being put back fits neatly between two pieces already there, only one of the two merges will happen. Which makes the whole afternoon and evening non-billable time; such is life.
Update 3: That didn't take long. Just needed one more check/merge following the append-after case. Seems to be preventing fragmentation so far. Now to leave a test running overnight, before declaring it good.
* Not really toasters, but a sort of appliance that has an embedded controller, which is my department.
** Not really an overdone-toast detector; see above.
*** Just because the Spartan 6 wasn't out yet when I designed the controller board. What kind of excuse is that?
**** Not really cheese, either, but surely you knew that already.
Recent Comments