Reliability
March 25th, 2008 by kerrysoft and tagged , application, design pattern, Java, SQL
Also see: Spring Web Flow features and feedback request
I’ve
been putting off writing this blog, not just because I’m on vacation in
Maui and have far more tempting things to do. It’s because one of my blogs has already
been used on Slashdot as evidence that Windows cannot scale and won’t support
distributed transactions (http://slashdot.org/comments.pl?sid=66598&cid=6122733 ), despite the fact that Windows scales well and does
support distributed transactions.
So I’m nervous that this new blog will be quoted somewhere as evidence
that managed applications cannot be reliable.
The fact
is that there are a lot of robust managed applications out there. During V1 of.NET, we all ran a
peer-to-peer application called Terrarium.
This could run as a background application and would become a screen
saver during periods of inactivity.
Towards the end of the release, I was running a week at a time without
incident. The only reason for
recycling the process after a week was to switch to a newer CLR so it could be
tested.
Also see: Web Access for Visual Studio Team System
Also see: Channel 9 Interview
Also see: Determining the Referencing Assembly
Also see: Big in Japan
Also see: Startup, Shutdown and related matters
Each
team ran their own stress suite in their own labs on dedicated machines. And many teams would “borrow” machines
from team members at night, to get even more machine-hours of stress
coverage. These stress runs are
generally more stressful than normal applications. For example, the ASP.NET team would put
machines under high load with a mix of applications in a single worker
process. They would then recycle
various AppDomains in that worker process every two or three minutes. The worker process was required to keep
running indefinitely with no memory growth, with no change in response time for
servicing requests, and with no process recycling. This simulates what happens if you
update individual applications of a web site over an extended period of
time.
Another
example of stress was our GC Stress runs.
We would run our normal test suites in a mode where a GC was triggered on
every thread at every point that a GC would normally be tolerated. This includes a GC at every machine
instruction of JITted code. Each GC
was forced to compact so that stale references could be detected.
We also
have a “threaded stress” which tries to break loose as many race conditions in
the CLR as possible by throwing many threads at each operation.
Our
stress efforts are getting better over time. The dev lead for the 64-bit effort
recently told me that we’ve already had more 64-bit process starts in our lab
than all of the 32-bit process starts during the 5 years of building V1 and
V1.1. We’ve got a very large (and
very noisy) lab that is crammed with 64-bit boxes that are constantly running
every test that we can throw at them.
Having
said all that, CLR reliability in V1 and V1.1 still falls far short of where we
would like it to be. When these
releases of the CLR encounter a serious problem, we “fail fast” and terminate
the process. Examples of serious
problems include:
Improve performance of customer support with Jerry Messenger. Live Chat Support to your website visitors.
- ExecutionEngineException
- An Access Violation inside mscorwks.dll or mscoree.dll
(except in a few specific bits of code, like the write barrier code, where AVs
are converted into normal NullReferenceExceptions). - A corrupt GC heap
- Stack overflow
- Out of memory
The
first three of the above examples are legitimate reasons for the process to
FailFast. They represent serious
bugs in the CLR, or serious bugs in (highly trusted / unsafe) portions of the
frameworks and the application.
It’s probably a security risk to continue execution under these
circumstances, because it’s easy to imagine cases where type safety or other
invariants have been violated.
Also see: Applied Metamodelling: A Foundation for Language Driven Development
Also see: When Will Foreign Ownership of US Sports Teams Start ?
Also see: Web Access for Visual Studio Team System
Also see: Generics and .NET
But the
last two cases (stack overflow and memory exhaustion) are a different
matter. In a perfect world – i.e.
some future version of our platform – these resource errors should be handled
more gracefully. This means that
the CLR should be hardened to tolerate them and the managed application should
be able to trap these conditions and handle them.
There’s
a lot of work involved in getting there.
I want to explain some of the complexity that’s involved. But first I would like to point out why
the current situation isn’t as bad as it seems.
For
example, ASP.NET is able to pursue 100% reliability without worrying too much
about resource errors. That’s
because ASP.NET can use AppDomain recycling and process recycling to avoid
gradual “decay” of the server process.
Even if the server is leaking resources at a moderate rate, recycling
will reclaim those resources. And,
if a server application is highly stack intensive, it can be run with a larger
stack or it can be rewritten to use iteration rather than recursion.
Also see: Simplifying Web Service development with JSR-181
Also see: What Are You Destined to Be ?
Also see: Claimspace, a Long Tail Recognition System
Also see: Simplifying Web Service development with JSR-181
And for
client processes, it’s historically been the case that excessive paging occurs
before actual memory exhaustion.
Since performance completely tanks when we thrash virtual memory, the
user often kills the process before the CLR’s FailFast logic even kicks in. In a sense, the human is proactively
recycling the process the same way ASP.NET does.
This
stopped being the case for server boxes some time ago. It’s often the case that server boxes
have enough physical memory to back the entire 2 or 3 GB of user address space
in the server process. Even when
there isn’t quite this much memory, memory exhaustion often means address space
exhaustion and it occurs before any significant paging has occurred.
This is
increasingly the case for client machines, too. It’s clear that many customers are
bumping up against the hard limits of 32-bit processing.
Also see: Channel 9 Interview
Also see: Claimspace, a Long Tail Recognition System
Also see: Startup, Shutdown and related matters
Also see: Startup, Shutdown and related matters
Also see: Generics and .NET
Also see: Generics and .NET
Also see: Why Yahoo should say Yes to MicroSoft
Also see: Determining the Referencing Assembly
Anyway,
back to stack overflow & memory exhaustion errors. These are both resource errors, similar
to the inability to open another socket, create another brush, or connect to
another database. However, on the
CLR team we categorize them as “asynchronous exceptions” rather than resource
errors. The other common
asynchronous exception is ThreadAbortException. It’s clear why ThreadAbortException is
asynchronous: if you abort another thread, this could cause it to throw a
ThreadAbortException at any machine instruction in JITted code and various other
places inside the CLR. In that
sense, the exception occurs asynchronously to the normal execution of the
thread.
(In fact, the CLR will currently induce a ThreadAbortException while you
are executing exception backout code like catch clauses and finally blocks. Sure, we’ll reliably execute your
backout clauses during the processing of a ThreadAbortException – but we’ll
interrupt an existing exception backout in order to induce a
ThreadAbortException. This has been
a source of much confusion. Please
don’t nest extra backout clauses in order to protect your code from this
behavior. Instead, you should
assume a future version of the CLR will stop inducing aborts so
aggressively.)
Also see: Win friends and influence your team
Also see: What Are You Destined to Be ?
Also see: Determining Whether a File Is an Assembly
Also see: From C# to Java: Part 5
Also see: Life Calculus
Also see: Should “Membership Stores” Be Permitted in Redmond’s Manufacturing Park Zone?
Also see: Simplifying Web Service development with JSR-181
Also see: What Are You Destined to Be ?
Also see: Blogging and Newspapers, a Lesson in How Not to Brand and Market
Also see: Bloggers in the Mavs Locker Room ?
Also see: Life Calculus
Also see: Web Access for Visual Studio Team System
Also see: Channel 9 Interview
Also see: The Internet is Officially Dead & Boring – Its the economy stupid !
Also see: Claimspace, a Long Tail Recognition System
Also see: Applied Metamodelling: A Foundation for Language Driven Development
Also see: Win friends and influence your team
Also see: Web Access for Visual Studio Team System
Also see: Determining Whether a File Is an Assembly
Also see: Generics and .NET
Also see: Spring Web Flow features and feedback request
Now why
would we consider stack overflow & memory exhaustion to be
asynchronous? Surely they only
occur when the application calls deeper into its execution or when it attempts
to allocate memory? Well, that’s
true. But unfortunately the extreme
virtualization of execution that occurs with managed code works against us
here. It’s not really possible for
the application to predict all the places that the stack will be grown or a
memory allocation will be attempted.
Even if it were possible, those predictions would be
version-brittle. A different
version of the CLR (or an independent implementation like Mono, SPOT, Rotor or
the Compact Frameworks) will certainly behave differently.
Here are
some examples of memory allocations that might not be obvious to a managed
developer:
- Implicit boxing occurs in some languages, causing value
types to be instantiated on the heap. - Class constructor (.cctor) methods are executed by the CLR
prior to the first use of a class. - In the future, JITting might occur at a finer granularity
than a method. For example, a
rarely executed basic block containing a ‘throw’ clause might not be JITted
until first use. - Although we chose to remove this from our V1 product, the
CLR used to discard code and re-JIT it even during return paths. - Class loading can be delayed until the first use of a
class. - For domain-neutral code, the storage for static fields is
duplicated in each AppDomain.
Some versions of the CLR have allocated this storage lazily, on first
access. - Operations on MarshalByRefObjects must sometimes be
remoted. This requires us to
allocate during marshaling and unmarshaling. Along the same lines, casting a
ComObject might cause a QueryInterface call to unmanaged code. - Accessing the Type of an instance, or accessing the current
Thread, or accessing various other environmental state might cause us to
lazily materialize that state. - Security checks are implicit to many operations. These generally involve
instantiation. - Strings are immutable. Many “simple” string operations must
allocate as a consequence. - Future versions of the CLR might delay allocation of
portions of an instance for better cache effects. This already happens for some state,
like an instance’s Monitor lock and – in some versions and circumstances – its
hash code. - VTables are a space-inefficient mechanism for dispatching
virtual calls and interface calls.
Other popular techniques involve caching dispatch stubs which are
lazily created.
The
above is a very partial list, just to give a sense of how unpredictable this
sort of thing is. Also, any dynamic
memory allocation attempt might be the one that drives the system over the edge,
because the developer doesn’t know the total memory available on the target
system, and because other threads and other processes are asynchronously
competing for that same unknown pool.
But a
developer doesn’t have to worry about other threads and processes when he’s
considering stack space. The total
extent of the stack is reserved for himf when the thread is created. And he can control how much of that
reservation is actually committed at that time.
It
should be obvious that it’s inadequate to only reserve some address space for
the stack. If the developer doesn’t
also commit the space up front, then any subsequent attempt to commit a page of
stack could fail because memory is unavailable. In fact, Windows has the unfortunate
behavior that committing a page of stack can fail even if plenty of memory is
available. This happens if the swap
file needs to be extended on disk.
Depending on the speed of your disk, extending the swap file can take a
while. The attempt to commit your
page of stack can actually time out during this period, giving you a spurious
fault. If you want a robust
application, you should always commit your stack reservations
eagerly.
Also see: Channel 9 Interview
Also see: Generics and .NET
Also see: Generics and .NET
Also see: Why Yahoo should say Yes to MicroSoft
Also see: Why Yahoo should say Yes to MicroSoft
So
StackOverflowException is more tractable than OutOfMemoryException, in the sense
that we can avoid asynchronous competition for the resource. But stack overflows have their own
special problems. These
include:
- The difficulty of predicting how much stack is
enough.
This is similar to the difficulty of predicting where and how large the
memory allocations are in a managed environment. For example, how much stack is used if a
GC is triggered while you are executing managed code? Well, if the code you are executing is
“while (true) continue;” the current version of the CLR might need several pages
of your stack. That’s because we
take control of your thread – which is executing an infinite loop – by rewriting
its context and vectoring it to some code that throws an exception. I have no idea whether the Compact
Frameworks would require more or less stack for this situation.
Also see: Channel 9 Interview
- The difficulty of presenting an exception via SEH
(Structured Exception Handling) when the thread is low on stack
space.
If
you are familiar with stack overflow handling in Windows, you know that there is
a guard bit. This bit is set on all
reserved but uncommitted stack pages.
The application must touch stack pages a page at a time, so that these
uncommitted pages can be committed in order. (Don’t worry – the JIT ensures that we
never skip a page). There are 3
interesting pages at the end of the stack reservation. The very last one is always
unmapped. If you ever get that far,
the process will be torn down by the operating system. The one before that is the application’s
buffer. The application is allowed
to execute all its stack-overflow backout using this reserve. Of course, one page of reserve is
inadequate for many modern scenarios.
In particular, managed code has great difficulty in restricting itself to
a single page.
The page before the backout reserve page is the one on which the
application generates the stack overflow exception. So you might think that the application
gets two pages in which to perform backout, and indeed this can sometimes be the
case. But the memory access that
triggers the StackOverflowException might occur in the very last bits of a
page. So you can really only rely
on 1 page for your backout.
A
further requirement is that the guard bit must be restored on the final pages,
as the thread unwinds out of its handling.
However, without resorting to hijacking some return addresses on the
stack, it can be difficult to guarantee that the guard bits can be
restored. Failure to restore the
guard bit means that stack overflow exceptions won’t be reliably generated on
subsequent recursions.
- The inability of the CLR (and parts of the operating
system!) to tolerate stack overflows at arbitrary places.
I’m told that a stack overflow exception at just the wrong place in
EnterCriticalSection or LeaveCriticalSection (I forget which) will leave the
critical section in a corrupt state.
Whether or not this is true, I would be amazed if the user-mode portion
of the operating system is completely hardened to stack overflow
conditions. And I know for a fact
that the CLR has not been.
In
order for us to report all GC references, perform security checks, process
exceptions and other operations, we need the stack to be crawlable at all
times. Unfortunately, some of the
constructs we use for crawling the stack are allocated on the stack. If we should take a stack overflow
exception while erecting one of these constructs, we cannot continue managed
execution. Today, this situation
drives us to our FailFast behavior.
In the future, we need to tighten up some invariants between the JIT and
the execution engine, to avoid this catastrophe. Part of the solution involves adding
stack probes throughout the execution engine. This will be tedious to build and
maintain.
- Unwinding issues
Managed exception handling is largely on the Windows SEH plan. This means that filter clauses are
called during the first pass, before any unwinding of the stack has
occurred. We can cheat a little
here: if there isn’t enough stack
to call the managed filter safely, we can pretend that it took a nested stack
overflow exception when we called it.
Then we can interpret this failure as “No, I don’t want to handle this
exception.”
When the first pass completes, we know where the exception will be caught
(or if it will remain unhandled).
The finally and fault blocks and the terminating catch clause are
executed during the second pass. By
the end of the second pass, we want to unwind the stack. But should we unwind the stack
aggressively, giving ourselves more and more stack for subsequent clauses to
execute in? If we are unwinding a
StackOverflowException, we would like to be as aggressive as possible. But when we are interoperating with C++
exceptions, we must delay the unwind.
That’s because the C++ exception is allocated on the stack. If we unwind and reuse that portion of
the stack, we will corrupt the exception state.
(In fact, we’re reaching the edges of my understanding here. I think that a C++ rethrow effectively
continues the first pass of the initial exception, looking for a new handler
further up the stack. And I think that this means the stack cannot
be unwound until we reach the end of a C++ catch clause. But I’m constantly surprised by the
subtleties of exception handling).
So it’s
hard to predict where OutOfMemoryException, StackOverflowException and
ThreadAbortException might occur.
But one
day the CLR will harden itself so it can tolerate these exceptions without
requiring a FailFast escape hatch.
And the CLR will do some (unspecified) fancy work with stack reserves and
unwinding so that there’s enough stack available for managed code to process
StackOverflowExceptions more like regular exceptions.
At that
point, managed code could process
asynchronous exceptions just like normal exceptions.
Where
does that leave the application?
Unfortunately, it leaves it in a rather bad spot. Consider what would happen if all
application code had to be hardened against these asynchronous exceptions. We already know that they can occur
pretty much anywhere. There’s no
way that the application can pin-point exactly where additional stack or memory
might be required – across all versions and implementations of the
CLI.
As part
of hardening, any updates the application makes to shared state must be
transacted. Before any protecting
locks are released via exception backout processing, the application must
guarantee that all shared state has been returned to a consistent state. This means that the application must
guarantee it can make either forward or backward process with respect to that
state – without requiring new stack or memory resources.
For
example, any.cctor method must preserve the invariant that either the class is
fully initialized when the.cctor terminates, or that an exception escapes. Since the CLR doesn’t support
restartable.cctors, any exception that escapes will indicate that the class is
“off limits” in this AppDomain.
This means that any attempt to use the class will receive a
TypeInitializationException. The
inner exception indicates what went wrong with initializing this class in this
AppDomain. Since this might mean
that the String class is unavailable in the Default AppDomain (which would be
disastrous), we’re highly motivated to add support for restartable.cctors in a
future release of the CLR.
Let’s
forget about managed code for a moment, because we know that the way we
virtualize execution makes it very difficult to predict where stack or memory
resources might be required.
Instead, imagine that you are writing this “guaranteed forward or
backward progress” code in unmanaged code.
I’ve done it, and I find it is very difficult. To do it right, you need strict coding
rules. You need static analysis
tools to check for conformance to those rules. You need a harness that performs fault
injection. You need hours of
directed code reviews with your brightest peers. And you need many machine-years of
stress runs.
It’s a
lot of work, which is only warranted for those few pieces of code that need to
be 100% robust. Frankly, most
applications don’t justify this level of effort. And this really isn’t the sweet spot for
managed development, which targets extremely high productivity.
So today
it’s not possible to write managed code and still be 100% reliable, and most
developers shouldn’t even try. But
our team has a broad goal of eventually supporting all unmanaged coding
techniques in managed code (with some small epsilon of performance loss). Furthermore, the CLR and frameworks
teams already have a need for writing some small chunks of reliable managed
code. For example, we are trying to
build some managed abstractions that guarantee resource cleanup. We would like to build those
abstractions in managed code, so we can get automatic support for GC reporting,
managed exceptions, and all that other good stuff. We’re prepared to invest all the care
and effort necessary to make this code reliable – we just need it to be possible.
So I
think you’ll see us delivering on this capability in the next release or
two. If I had to guess, I think
you’ll see a way of declaring that some portion of code must be reliable. Within that portion of code, the CLR
won’t induce ThreadAbortExceptions asynchronously. As for stack overflow and memory
exhaustion, the CLR will need to ensure that sufficient stack and memory
resources are available to execute that portion of code. In other words, we’ll ensure that all
the code has been JITted, all the classes loaded and initialized, all the
storage for static fields is pre-allocated, etc. Obviously the existence of indirections
/ polymorphism like virtual calls and interface calls makes it difficult for the
CLR to deduce exactly what resources you might need. We will need the developer to help out
by indicating exactly what indirected resources must be prepared. This will make the technique onerous to
use and somewhat expensive in terms of working set. In some ways, this is a good thing. Only very limited sections of managed
code should be hardened in this manner.
And most of the hardened code will be in the frameworks, where it
belongs.
Here are
my recommendations:
- Application code is responsible for dealing with
synchronous application exceptions.
It’s everyone’s responsibility to deal with a FileNotFoundException
when opening a file on disk. - Application code is not responsible for dealing with
asynchronous exceptions. There’s
no way we can make all our code bullet-proof when exceptions can be triggered
at every machine instruction, in such a highly virtualized execution
environment. - Perhaps 0.01% of code will be especially hardened using
not-yet-available techniques.
This code will be used to guarantee that all resources are cleaned up
during an AppDomain unload, or to ensure that pending asynchronous operations
never experience races when unloads occur. We’re talking tricky “systems”
operations.
Well, if
the application isn’t responsible for dealing with asynchronous exceptions, who
is?
That’s
the job of the process host. In the
case of ASP.NET, it’s the piece of code that decides to recycle the process when
memory utilization hits some threshold.
In the case of SQL Server, it’s the piece of code that decides whether to
abort a transaction, unload an AppDomain, or even suspend all managed
activity. In the case of a random
console application, it’s the [default] policy that might retain the FailFast
behavior that you’ve seen in V1 and V1.1 of the CLR. And if Office ever builds a managed
version, it’s the piece of code I would expect to see saving edits and unloading
documents when memory is low or exhausted.
(Don’t read anything into that last example. I have no idea when/if there will be a
managed Excel). In other words, the
process host knows how important the process is and what pieces can be discarded
to achieve a consistent application state.
In the case of ASP.NET, the only reason to keep the process running is to
avoid a pause while we spin up a new process. In the case of SQL Server, the process
is vitally important. Those guys
are chasing 5 9’s of availability.
In the case of a random console application, there’s a ton of state in
the application that will be lost if the process goes down. But there probably isn’t a unit of
execution that we can discard, to get back to a consistent application
state. If the process is corrupt,
it must be discarded.
In V1
& V1.1, it’s quite difficult for a process host to specify an appropriate
reaction or set of reactions when an asynchronous exception occurs. This will get much easier in our next
release.
As
usual, I don’t want to get into any specifics on exactly what we’re going to
ship in the future. But I do hope
that I’ve painted a picture where the next release’s features will make sense as
part of a larger strategy for dealing with reliability.
http://blogs.msdn.com/cbrumme/archive/2003/06/23/51482.aspx
Posted in Technology | Comments Off
