Been tracking a series of Mysterious Problems for a few days now.
All somehow-or-other concern a 25xx-series SPI flash memory chip, and the use thereof.
The data structures in use here are somewhat complicated, and the access functions incompletely debugged, so I've been fixing bugs in the access functions that had been causing the toaster controller to reboot at inconvenient times.
The latest thing, that's occupied my working hours this weekend, is corruption at the beginning of Log Region 2, which is used to store a modest number of variable-length records adding detail to the 32-byte fixed-length records stored in Log Region 1. Somehow, the entry at Region 2, offset 0, is getting overwritten.
And, no matter what traps I set on the write operations, I'm Just Not Seeing It.
Gotta be some kind of race condition... maybe?
So I add a mutex on the flash memory, and that doesn't solve the problem. I toss in delays, and they don't solve the problem.
So I set a watchpoint of sorts. I add a task that just buzzes in a loop, sleeping 50 ms then reading the entry in question and checking to see if the timestamp has changed.
And we see:
ltl_write: area 1, addr 00000540, abs addr 00200540, count 32
ltl_write: area 2, addr 000001F8, abs addr 03C001F8, count 24
**************** Entry @2,00000000 changed 000887 to 000084 ****************
Hm. Try a couple more times. And ya know what? It's always just after a write to 01F8. Which...
Oh.
This sort of flash chip doesn't deal elegantly with write commands that try to cross page boundaries.
I hadn't had problems before, because I'd been writing entries of 32 bytes (Log Region 1) or 512 bytes (firmware updates).
And so... somewhere down in the bowels of the SPI flash protocol stack... I need to insert the same sort of code I have in the I2C EEPROM protocol stack, to break up writes that would cross page boundaries. Or, rather: I need to fix it, because the code is there, dang it, but apparently not catching the case of "less than a page, but misaligned".
Comments