Categories

Blogroll

Search

Memory Model

May 28th, 2008 by kerrysoft and tagged , , ,



One of
the suggestions for a blog entry was the carryed off memory model.  This is seasonable, because we’ve barely been
retooling our overall approach to this confounding topic.  For the most part, I write on product
decisions that have already been produced and embarked.  In this note, I’m discoursing succeeding
directions.  Be
doubting.


 


Indeed what
is a memory model?  It’s the
abstraction that reachs the reality of today’s alien hardware comprehendible to
software developers.


 


The
reality of hardware is that CPUs are renaming registers, doing bad
and out-of-order execution, and geting up the world during retirement.  Memory state is cached at assorted levels
in the system (L0 thru L3 on mod X86 boxes, presumptively with more levels on
the way).  Some levels of cache are
shared between especial CPUs but not others.  For example, L0 is typically per-CPU but
a hyper-wandered CPU may partake in L0
between the coherent CPUs of a single forcible CPU.  Or an 8-way box may parted the system into two
hemispheres with cache controllers doing an elaborated coherency protocol
between these freestanding hemispheres. 
If you count hoarding effects, at some level all MP (multi-processor)
computers are NUMA (non-unvarying memory access).  But there’s enough magic proceeding that
yet a Unisys 32-way can mostly be counted as UMA by
developers.


Multisoft Group: Custom software solutions for your business.

 


It’s
sensible for the CLR to cognize as much as potential about the cache architecture
of your hardware so that it can tap any imbalances.  For example, the developers on our
performance team have experimented with a scalable rendezvous for phases of the
GC.  The idea was that each CPU
bases a rendezvous with the CPU that is “nighest” to it in distance in the
cache hierarchy, and and then one of this pair cascades down up a tree to its nearest
neighbor until we progress to a single root CPU. 
At that point, the rendezvous is consummate.  I call back the jury is however out on this
exceptional technique, but they have found some other techniques that in truth make up
off on the bigger systems.


 


Of
course, it’s absolutely undue for any dealt developer (or 99.99% of
unmanaged developers) to e’er occupy themselves with these imbalances.  Rather, software developers want to
process all computers as tantamount. 
For cared developers, the CLR is the computer and it better act
systematically no matter of the underlying machine.


 


Although dealt developers shouldn’t cognise the difference between a 4-way
AMD server and an Intel P4 hyper-wound double proc, they yet necessitate to front the
realities of today’s hardware. 
Today, I retrieve the penalty of a CPU cache miss that runs all the way to
independent memory is virtually 1/10th the penalty of a memory miss that leads
all the way to disk.  And the trend
is unmortgaged.


 


If
you desired well performance on a practical memory system, you’ve ever been
responsible for for excusing the paging system by catching well page density and
locality in your data structures and access patterns.


 


In
a like vein, if you desire well performance on today’s hardware, where
geting at independent memory is a modest disaster, you must carry your data into cache
lines and limit indirections.  If
you are progressing partaken in data structures, see spliting any data that’s
subject to sour sharing.

Live Help Server: Jerry Messenger is Jabber/XMPP Live Chat Server for a website.

 


To
some extent, the CLR can facilitate you hither. 
On MP machines, we utilize lock-spare allocators which (statistically)
guarantee locality for each thread’s allocations.  Any compaction will (statistically)
save that locality.  Checking out the very far succeeding – maybe after our sun irrupts – you could suppose a
CLR that can reorganise your data structures to reach even best
performance.


 


This intends that if you are saving single-wandered dealt code to
process a server request, and if you can fend off writing to any partaken state, you
are likely leading to be pretty scalable without yet essaying.


 


Catching
backward to memory models, what is the abstraction that will get to sense of current
hardware?  It’s a simplifying model
where all the cache levels go away. 
We hazard that all the CPUs are inhered in a single partaken in memory.  Nowadays we hardly demand to cognize whether all the
CPUs realise the same state in that memory, or if it’s potential for some of them to
realize reordering in the loads and stores that occur on other CPUs.


 


At one
utmost, we have a world where all the CPUs realize a single ordered memory.  All the loads and stores expressed in
programs are performed in a serialized manner and nobody comprehends a especial
thread’s loads or stores being reordered. 
That’s a wonderfully reasonable model which is well-fixed for software developers to
cover and program to. 
Regrettably, it is far too deadening and non-scalable.  Nobody progresss this.


 


At the
other uttermost, we have a world where CPUs go almost all out of individual
cache.  If another CPU always realizes
anything my CPU is behaving, it’s a full accident of timing.  Because loads and stores can propagate
to other CPUs in any random order, performance and surmounting are gravid.  But it is unacceptable for humans to
program to this model.

Custom Software Development for Real-Estate, Hosting providers, Workflow and Business Management Systems.

 


In
between those extremes are a lot of dissimilar possibilities.  Those possibilities are explained in
terms of develop and release semantics:


 



  • A normal load or store can be freely reordered with respect
    to other normal load or store operations.

  • A load with produce semantics makes a downwardly
    fence.  This intends that normal
    loads and stores can be displaced down past the load.produce, but nothing can be
    moved to above the load.produce.

  • A store with release semantics makes an upward
    fence.  This thinks that normal
    loads and stores can be moved above the store.release, but nothing can be
    moved to below the store.release.

  • A total fence is efficaciously an up and down
    fence.  Nothing can get in either
    direction across a total fence.

 


A
super-firm utmost model frames a total fence after every load or store.  A super-frail uttermost model emploies normal
loads and stores all over, with no fencing.


 


The most
conversant model is X86.  It’s a
relatively strong model.  Stores are
ne’er reordered with respect to other stores.  But, in the absence of data dependence,
loads can be reordered with respect to other loads and stores.  Many X86 developers don’t realise that
this reordering is potential, though it can lead to some awful failures under
stress on large MP machines.


 

Help Desk Software: Next generation of Live Chat. Jabber/XMPP Live Chat Service for your website.

In terms
of the above, the memory model for X86 can be delineated as:


 



  1. All stores are in reality store.release.

  2. All loads are normal loads.

  3. Any use of the LOCK prefix (e.g. ‘LOCK CMPXCHG’ or ‘LOCK
    INC’) makes a total fence.

 


Historically, Windows NT has kept going Alpha and MIPS computers.


 


Depending
forrad, Microsoft has denoted that Windows will hold up Intel’s IA64 and
AMD’s AMD64 processors.  Finally,
we require to port the CLR to wherever Windows runs.  You can make an obvious conclusion from
these facts.


 


AMD64
has the same memory model as X86.


 


IA64
qualifies a frail memory model than X86. 
Specifically, all loads and stores are normal loads and stores.  The application must employ especial ld.acq
and st.rel instructions to attain develop and release semantics.  There’s too a total fence instruction,
though I can’t call up the opcode (mf?).


 


Be
especially disbelieving when you take the next paragraph:


 


There’s
some reason to trust that current IA64 hardware in reality implements a firm
model than is qualifyed.  Based on
informed hearsay and lots of data-based evidence, it reckons like normal store
instructions on current IA64 hardware are retired in order with release
semantics.

Live Person: Live Chat Solution for Online Customer Service on Website.

 


If this
is indeed the case, why would Intel condition something imperfect than what they have
worked up?  Presumptively they would do
this to allow for the door undetermined for a imperfect (i.e. faster and more scalable)
implementation in the future.


 


In fact,
the CLR has behaved just the same thing. 
Section 12.6 of Partition I of the ECMA CLI specification explicates our
memory model.  This explicates the
alignment rules, byte ordering, the atomicity of loads and stores, explosive
semantics, puting away behavior, etc. 
According to that specification, an application must apply explosive loads
and explosive stores to reach grow and release semantics.  Normal loads and stores can be freely
reordered, as seen by other CPUs.


 


What is
the hard-nosed implication of this? 
Regard the received double-locking in protocol:


 


if (a == null)


{


  lock(obj)


  {


    if (a == null) a = new
A();


  }


}

Developing Customer Relationship Management Solutions. Web, e-Commerce, Database Design and Software Development.

 


This is
a rough-cut technique for forefending a lock on the read of ‘a’ in the distinctive
case.  It acts just hunky-dory on
X86.  But it would be broken by a
effectual but imperfect implementation of the ECMA CLI spec.  It’s honest that, according to the ECMA
spec, geting a lock has grow semantics and freing a lock has release
semantics.


 


Even so,
we have to accept that a series of stores have read place during construction
of ‘a’.  Those stores can be
at random reordered, including the possibility of checking them until after
the printing store which deputes the newfangled object to ‘a’.  At that point, there is a little window
before the store.release implied by providing the lock.  Inside that window, other CPUs can
navigate through the reference ‘a’ and realise a partially constructed
instance.


 


We could
pay back this code in assorted ways.  For
example, we could sneak in a memory barrier of some sort after construction and
before assignment to ‘a’.  Or – if
construction of ‘a’ has no side effects – we could displace the assignment outside
the lock, and utilise an Interlocked.CompareExchange to ascertain that assignment but
befalls one time.  The GC would accumulate
any redundant ‘A’ instances created by this race.


 


I hope
that this example has won over you that you put on’t desire to examine saving honest
code against the documented CLI model.


 


I saved
a mediocre amount of “canny” lock-spare thread-dependable code in version 1 of the
CLR.  This let in techniques like
lock-spare synchronization between the class loader, the prestub (which snares
first off calls on methods so it can yield code for them), and AppDomain
unlading indeed that I could backward-patch MethodTable slots expeditiously.  But I have no desire to save any kind
of code on a system that’s as frail as the ECMA CLI spec.


 


Still if
I proved to save code that is robust under that memory model, I have no hardware
that I could prove it on.  X86, AMD64
and (presumptively) IA64 are firm than what we stipulated.


 


In my
opinion, we drove in up when we stipulated the ECMA memory model.  That model is excessive
because:


 



  • All stores to partaken in memory in truth ask a explosive
    prefix.

  • This is not a fat way to code.

  • Developers will much progress to mistakes as they trace this
    burdensome discipline.

  • These mistakes cannot be discovered through testing,
    because the hardware is too firm.

 


Indeed what
would reach a sensitive memory model for the CLR?


 


Well,
foremost we would desire to have a logical model across all CLI
implementations.  This would admit
the CLR, Rotor, the Succinct Frameworks, SPOT, and – ideally – non-Microsoft
implementations like Mono.  Indeed
casting a coarse memory model into an ECMA spec was decidedly a well
idea.


 


It goes
without supposing that this model should be ordered across all potential
CPUs.  We’re in magnanimous trouble if
everyone is testing on X86 but so deploying on Alpha (which had a notoriously
fallible model).


 


We would
as well desire to have a logical model between the aboriginal code generator (JIT or
NGEN) and the CPU.  It doesn’t reach
sense to stiffen the JIT or NGEN to order stores, but so let the CPU to
reorder those stores.  Or vice
versa.


 


Ideally,
the IL generator would too trace the same model.  In other words, your C# compiler should
be permited to reorder whatever the aboriginal code generator and CPU are let to
reorder.  There’s some debate
whether the converse is honest. 
Arguably, it is fine for an IL generator to employ more belligerent
optimizations than the aboriginal code generator and CPU are allowed, because IL
generation occurs on the developer’s box and is subject to testing.


 


Finally, that last point is a language decision sort of than a CLR
decision.  Some IL generators, like
ILASM, will strictly give out IL in the sequence specified by the source
code.  Other IL generators, like
Dealt C++, might quest for belligerent reordering based on their ain language
rules and compiler optimization switches. 
If I had to suppose, IL generators like the Microsoft compilers for C# and
VB.NET would make up one’s mind to value the CLR’s memory model.


 


We’ve
passed a lot of time flirting with what the right memory model for the CLR
should be.  If I had to think, we’re
extending to switch from the ECMA model to the tracing model.  I cogitate that we will test to carry
other CLI implementations to take up this same model, and that we will prove to
exchange the ECMA specification to contemplate this.


 



  1. Memory ordering but applies to locations which can be
    globally seeable or locations that are crossed out explosive.  Any locals that are not direct
    let out can be optimized without utilizing memory ordering as a constraint since
    these locations cannot be touched by multiple threads in parallel.

  2. Non-explosive loads can be reordered freely.

  3. Every store (irrespective of explosive marking) is seen
    a release.

  4. Explosive loads are regarded develop.

  5. Device orientated software may demand exceptional programmer
    care.  Explosive stores are yet
    required for any access of device memory.  This is typically not a concern for
    the cared developer.

 


If
you’re cerebrating this depends an nasty lot like X86, AMD64 and (presumptively) IA64,
you are correct.  We likewise cerebrate it
strikes the seraphic spots for compilers. 
Reordering loads is much more crucial for enabling optimizations than
reordering stores.


 


Indeed what
happens in 10 years when these architectures are passed and we’re all utilising
futurist Starbucks computers with an extremist-imperfect model?  Well, hopefully I’ll be experiencing the well
life in retirement on Maui.  But the CLR’s aboriginal code generators
will render whatever instructions are necessary to hold stores arranged when
runing your bing programs. 
Evidently this will give some performance.


 


The
trade-off between developer productivity and computer performance is truly an
economical one.  If there’s sufficient
incentive to save code to a fallible memory model so it can run expeditiously on
next computers, so developers will do indeed.  At that point, we will permit them to
cross off their assemblies (or item-by-item methods) to argue that they are “frail
model unobjectionable”.  This will allow the
aboriginal code generator to give out normal stores kind of than store.release
instructions.  You’ll be capable to
attain eminent performance on imperfect machines, but this will e’er be “opt
in”.  And we gained’t work up this
capability until there’s a veridical demand for it.


 


I
in person trust that for mainstream figuring, imperfect memory models will ne’er
get on with human developers. 
Human productivity and software reliability are more crucial than the
increment of performance and surmounting these models allow.


 


Last,
I cogitate the person asking about memory models was in truth interested in where he
should utilise explosive and fences in his code.  Here’s my advice:


 



  • Apply cared locks like Monitor.Enter (C# lock / VB.NET
    synclock) for synchronization, except where performance in truth expects you to
    be “cagy”.

  • When you’re being “canny”, accept the relatively firm
    model I delineated higher up.  Only
    loads are open to re-ordering.

  • If you have more than a few places that you are utilizing
    explosive, you’re in all probability being too cagy.  See indorsing off and employing cared
    locks rather.

  • See that synchronization is expensive.  The total fence implied by
    Interlocked.Increment can be many 100’s of cycles on forward-looking hardware.  That penalty may carry on to develop, in
    proportional terms.

  • See locality and hiving up effects like blistering spots due to
    imitation sharing.

  • Stress test for days with the largest MP box you can get
    your hands on.

  • Read everything I said with a grain of
    salt.

Posted in Technology | Comments Off

Create a free edublog to get your own comment avatar (and more!)

Comments are closed.