Had an interesting Situation last week, on the same project mentioned in yesterday afternoon's post.
Response time to a fairly simple HTTP request was all over the place. Well, sort of all over. In my testing, it was somewhere between 2 and 20 milliseconds, mostly at the low end of the range. At the client site, it was usually just a little over 2 ms, but every once in a while it would be slightly over 200, 300, or 400 ms.
Eh?
And so I headed over there to see what was different. Firing up Wireshark, we found that there was indeed a 400 ms (in the setup at hand, with a Mac initiating the request) gap in the gadget's response, every once in a while, caused by a complete failure to respond to the request and the Mac retrying the packet.
Fun fact: When PHP's file_get_contents function is used to make an HTTP request, it sends each line of the request as a separate packet.
Anyway, the first thought was that the packet was somehow getting lost, so I spent quite a while, with the customer watching, trying to find Ethernet error indications.
Pro tip: If you set a breakpoint in the Eclipse debugger, and it obstinately happens on the first line of the next function in the source file, then probably the function on which you were setting breakpoints isn't actually used anywhere and has been optimized out, so that all lines within it resolve to the next line that actually has code associated with it. Eclipse won't tell you that this has happened.
Well, there was no sign of any errors, nor buffer overruns, nor any of those things. Then I noticed that there was a DUP ACK associated with the event: that both the missed packet and its replacement were in fact being ACK'ed, but not until the replacement came in.
So not a lost packet, but a delay in dealing with it.
Eventually, this led to: the original packet was correctly received, and just sat there, then when another packet arrived they were both processed promptly.
At this point, I took the "process any received packets" code from the interrupt routine in the Ethernet driver and copied it into the 1 ms polling routine.
Ta-daaaaa! Now everything's processed promptly.
And it explains why I wasn't seeing the problem; there's enough broadcast activity on my network that, if a receive interrupt was missed, there was a pretty high probability of another packet showing up soon and getting things moving again.
I haven't positively confirmed the cause, but the interrupt routine (Somebody Else's Code) appears to be munching the data and then clearing the flags. Which can be problematic, if another packet arrives during processing (see also: each line of the request sent as a separate packet).
Also, I'm thinking the interrupt routine shouldn't be doing the data-munching anyway. Maybe it should just clear the flags and kick a semaphore, and munching of received data should happen in task space rather than the interrupt?
Yeah, just a crazy thought. But... since the TCP/IP stack in question is not task-safe... is it interrupt-safe? Or could the interrupt-driven data-munching cause problems, beyond the obvious one of spending excessive time in the interrupt handler?
The more I work with this third- and fourth-party code, the more I think I need to find the time to port AGROS to the Cortex M architecture and polish it up for a proper open-source release.
Comments