[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: RE: Weird irreproducible error
- From: "Vijay Aswadhati" <vijay@...>
- Date: Sun, 14 Oct 2007 12:24:08 -0700
On Tuesday, October 09, 2007 9:06 AM, Steve Heller wrote:
> On Tue, 09 Oct 2007 11:39:43 -0400, Ralph Hempel
> <[email protected]> wrote:
>
>> Steve Heller wrote:
>>
>>> [snip]
>>> I did write it incrementally, and it worked fine until I had
>>> these weird errors crop up. In fact, it still works fine on
>>> the other machines on which it is installed.
>>
>> Ummm, that's a data point that was missing in previous posts.
>
> Well, the other people haven't run it nearly as much as I have,
> so I'm not sure whether they will run into these problems.
>
>> Now we're looking at what might be a hardware error, or it
>> might be a software error that only manifests itself on certain
>> hardware...
>
This is certainly an interesting data point that is worth looking
into. At my previous company that used to make telephony hardware we
had the weirdest problem. The telephony hardware, the device drivers
and the user space library allowed developers to create applications
like IVR (Interactive Voice Response), Voice Mail and such. The API
to control the hardware gave the developer access to telephony
events such as 'PHONE_ON_HOOK', 'PHONE_OFF_HOOK', etc; not exactly
the same names but you should get the idea.
One fine morning all the tests started failing... randomly. The test
applications used to get what seemed like random stream of telephony
events like 'PHONE_ON_HOOK', 'PHONE_OFF_HOOK', 'DIGIT_DETECTED', ...
and on and on. Most of the time the tests would run to completion
without any errors. Occasionally these stream of spurious events
(spurious because the test did not ask the hardware to go off hook,
on hook or press a digit) would ruin 12 hours worth of batch test.
After several days of scratching heads by people in hardware, device
driver and middleware group, a pattern started emerging. And the
pattern was this:
- the tests would complete fine if no one touched the machine
(most of the time)
- the tests would complete when run by this guy (let's call him
Mr. Keyboard for he never uses a mouse)
By this time everyone was looking at their last checkins more
thoroughly. We ruled out hardware and the middleware since there
were no commits at these layers.
To cut a long story short, the device driver guy had gotten his hex
numbers wrong and had installed an interrupt handler for the mouse.
So every time the mouse moved the interrupt handler would read
registers of the telephony hardware which contained garbage and
interpreting it to mean some event in the telephony world.
Alright, I did format the story to fit the audience! But getting
back to the data point, I would certainly ask questions like what
changed on this machine: software, hardware and device drivers. And
rollback each until you find the culprit. From following this thread
I would place my bet that the problem is elsewhere and not in Lua.
Cheers,
Vijay Aswadhati