r***@yahoo.com
2009-11-23 18:10:19 UTC
I have a problem with a production system in that some of its
processes recently experience a software error leading to a crash. In
production machines, I have setup Dr. Watson to catch a crash dump
file for post-mortem analysis. The problem is that I observe recently
that despite Dr. Watson being correctly registered as JIT debugger
(AeDebug key), no crash dump is actually produced and the faulty
process magically just disappears without leaving any trace about what
happened whatsoever.
The software platform that I am using is Windows XP x64 Professional
Edition (64 bits). This is compiled into 64 bits release of Microsoft
Visual 8 (Developer Studio 2005) into native C code.
I have managed to reproduce this behavior with a small 30-line C
program. The crucial thing is that there is buffer overrun in one of
the functions. This buffer overrun also erases the function return
address. The function thus returns to a bogus location and the faulty
RIP register immediately produces Access Violation Exception when next
instruction is fetched.
This Access Violation Exception is always caught when the faulty
process is run under debugger. However, when the process is running
free (and not as debugee) then this Access Violation cannot be caught
by any means (is not caught by function level SEH __try+__except
block; is not caught by UnhandledExceptionFilter, is not caught by
external JIT debugger) - effectively disabling a possibility to
arrange for a post-mortem crash analysis.
As an experiment, I have switched back into win32 and discovered that
everything works properly there. Access Violation Exception is
properly caught by function level SEH frame, also by
UnhandledExceptionFilter and by external JIT debugger. Everything
works on win32, nothing works on x64.
Turned to some Internet studies and discovered that SEH in Windows x64
platform was given a major overhaul and is quite a different
implementation from SEH in win32.
Still can't find an answer however: why Access Violation Exception
stemming from buffer overrun and stack corruption cannot be caught by
any means in x64? Having critical processes just silently vanishing
from production system without leaving any trace behind is a big
concerns to me.
The small 30-line C program reproducing this behaviour follows:
compile and link it with Microsoft Visual C 8, (console application), /
EHa, release mode, x64 (rest of settings are pretty much standard) -
have JIT debugger setup in registry and see how the process just
vanishes without giving JIT tool any chance to get hold of it
#include <stdio.h>
#include <stdlib.h>
#include <Windows.h>
int func_a()
{
int res = 3;
int arr[4000];
int i;
// write 12 extra bytes so that to overwrite the function return
address
// make this code complex enough so that optimizer does not
eliminate it
for (i = 0; i < 4003; ++i)
arr[i] = rand();
for (i = 0; i < 4003; ++i)
res += arr[i];
return res;
}
int main()
{
func_a();
printf("Sleep\n");
Sleep(2000);
printf("Done\n");
return 0;
}
I understand that the stack is corrupt, so stack trace might not be
available, but there is still lots of useful information in the failed
process for post-mortem analysis (partially corrupt stack trace, all
RAM image, registers etc)? Why can't crash dump be produced in this
scenario?
Robert
processes recently experience a software error leading to a crash. In
production machines, I have setup Dr. Watson to catch a crash dump
file for post-mortem analysis. The problem is that I observe recently
that despite Dr. Watson being correctly registered as JIT debugger
(AeDebug key), no crash dump is actually produced and the faulty
process magically just disappears without leaving any trace about what
happened whatsoever.
The software platform that I am using is Windows XP x64 Professional
Edition (64 bits). This is compiled into 64 bits release of Microsoft
Visual 8 (Developer Studio 2005) into native C code.
I have managed to reproduce this behavior with a small 30-line C
program. The crucial thing is that there is buffer overrun in one of
the functions. This buffer overrun also erases the function return
address. The function thus returns to a bogus location and the faulty
RIP register immediately produces Access Violation Exception when next
instruction is fetched.
This Access Violation Exception is always caught when the faulty
process is run under debugger. However, when the process is running
free (and not as debugee) then this Access Violation cannot be caught
by any means (is not caught by function level SEH __try+__except
block; is not caught by UnhandledExceptionFilter, is not caught by
external JIT debugger) - effectively disabling a possibility to
arrange for a post-mortem crash analysis.
As an experiment, I have switched back into win32 and discovered that
everything works properly there. Access Violation Exception is
properly caught by function level SEH frame, also by
UnhandledExceptionFilter and by external JIT debugger. Everything
works on win32, nothing works on x64.
Turned to some Internet studies and discovered that SEH in Windows x64
platform was given a major overhaul and is quite a different
implementation from SEH in win32.
Still can't find an answer however: why Access Violation Exception
stemming from buffer overrun and stack corruption cannot be caught by
any means in x64? Having critical processes just silently vanishing
from production system without leaving any trace behind is a big
concerns to me.
The small 30-line C program reproducing this behaviour follows:
compile and link it with Microsoft Visual C 8, (console application), /
EHa, release mode, x64 (rest of settings are pretty much standard) -
have JIT debugger setup in registry and see how the process just
vanishes without giving JIT tool any chance to get hold of it
#include <stdio.h>
#include <stdlib.h>
#include <Windows.h>
int func_a()
{
int res = 3;
int arr[4000];
int i;
// write 12 extra bytes so that to overwrite the function return
address
// make this code complex enough so that optimizer does not
eliminate it
for (i = 0; i < 4003; ++i)
arr[i] = rand();
for (i = 0; i < 4003; ++i)
res += arr[i];
return res;
}
int main()
{
func_a();
printf("Sleep\n");
Sleep(2000);
printf("Done\n");
return 0;
}
I understand that the stack is corrupt, so stack trace might not be
available, but there is still lots of useful information in the failed
process for post-mortem analysis (partially corrupt stack trace, all
RAM image, registers etc)? Why can't crash dump be produced in this
scenario?
Robert