Chapter 2. Management of DT guest-to-tcode address translations


As the DT engine decodes and translates sequences of guest instructions,
it places the translated sequences in a tcode (translated code) buffer.
It is important that it also store the pairing of both the original guest
address and the generated tcode address for future reference by the DT
engine, and for effecting branches by special handler routines.  Because
of dynamic (computed) branch targets and out-of-area static branch targets,
some special efficient branch handling routines must be able to
lookup guest addresses and find corresponding tcode addresses
(if there exists translated code) at a high duty rate.

The constructs used to hold and access such address translation
pairings, must therefore be extremely efficient.

Because plex86 provides a complete level of PC virtualization,
including user/system privilege level changes, page table changes,
multiple processes, reloads of the code segment (CS) etc, there
are some additional considerations imposed on it, with respect to
managing guest instruction addresses and the validity of corresponding
tcode.  All of these sorts of events and capabilities have to be
accounted for in the translation storage scheme, while providing
the most optimal performance possible.  Many of these items are
issues which a user-level-only DT scheme would not have to deal with.

Linear to tcode address mapping


The first diagram depicts the translation of a guest linear instruction
address to a tcode address using a hash table.  Imagine that for
this example, we are executing some tcode which represents a guest
computed branch instruction.  After the tcode computes the target address,
a special branch handler routine is called to perform the lookup and
effect the transfer.

A hash on the 32bit linear address yields a row number in the
hash table.  Which and how many bits of the address are used, is still
open for discussion.  Taking some instrumentation data from real
software running in plex86 should also provide some useful insights here.
In any case, let's just assume that we use N bits of the address,
and thus the table height H is 2^N.

Each row contains a set of W (the table width) address pairs.
We search iteratively through the set, looking for the original
address.  If found, then the 2nd quantity in the pair provides
the corresponding tcode address.  The branch handler can use this
address to effect the branch.  It is likely that we would want to
use a value for W, such that it aligns the size of the set on
a native CPU cache-line boundary.  For example, each pair holds
2 4-byte addresses (8-bytes).  Since the Pentium has a 32-byte
cache line, obvious values for W would be 4 (1 cache line) or
8 (2 cache lines).  The idea is to have the CPU burst in as few
cache lines as possible in our search for the translated address.

Another simple performance tweak at this point, would be to migrate the
matching pair to the left on the diagram, giving it a higher priority
for the next hash search.  Again, some real instrumentation data
would be idea here, to tune the algorithm.

The other factor influencing the hash table width W, is the
desire to prevent overflows for any given row (set).  A larger
value W prevents overflows, but also increases the memory requirements
and runs into more cache lines.

There are some less favorable attributes of such a hash table,
depending on the hash function used.  If each row contains unassociated
pairings, then associated searches will be scattered throughout the
table, not utilizing cache lines optimally, especially if the table
is large.  If each row is highly associative, then overflows may
occur frequently.

At any rate, the diagram depicts the ideal situation; one which
we would like to make occur most of the time.  That is, a hash
lookup quickly determines the corresponding tcode address so
the branch can commence.  There are of course, several situations
where the lookup may not succeed.  An obvious one would be when
a previous set overflow occurred, losing one address pair.  If
overflows become problematic, we could look into overflow strategies,
where other non-full sets are utilized.

As previously mentioned, plex86 has some extra considerations due
to context switches and other events occurring in the VM which
bear significantly on the validity of our mappings.  For instance,
consider when the time slice for one user program ends, and the guest
OS schedules another user program to run.  Each process will require
its own page tables, perhaps some of the pages being shared
(shared memory), but a number of its own linear page mappings.

Each process may have different code executing at a particular
linear address.  Yet our hash table stores only the notion of linear
addresses.  Thus we must flush this hash table each time there
are modifications to page tables.  It can be rebuilt dynamically
as each process executes in its timeslice.

For sake of generating efficient code, and handling branches efficiently,
it is quite helpful to target the generated code to work under certain
constraints, so that many checks can be done once at translation time,
rather than continuously at execution time.  For example, it is useful
to target tcode to work only for the current code segment descriptor
values such as the base and limit, for the current privilege level,
and other values.  A change in the VM environment which is contrary
to any such constraints dictates a flush of this guest to tcode
address table, since the tcode was generated to operate under
assumptions which are no longer correct.  For example, a transition
from user code (ring3) to kernel code (ring0) would require a table
flush.

That is _not_ to say that we flush the actual tcode!  To do so would
mean that we would continually be retranslating the same active
code, every time a process got rescheduled to run.  We only
flush our high performance address translation buffer, which we can
rebuild dynamically, next time the process executes.  A certain
amount of overhead will be incurred rebuilding addresses in the
table each time they are newly encountered, but after each address
is re-added, translations will operate at high performance.

We therefore need an architecture that will satisfy the following
requirements, while remaining memory and processor time sensitive:

  - Allow for a decoupling of the linear to physical memory mappings
    to a code page with associated tcode (because other processes
    get time slices and have different linear to physical mappings).
  - Allow for a re-coupling of the same mappings, so tcode can be reused when
    the process gets rescheduled.
  - Maintain the set of constraints for each page of translated code,
    so we can identify the tcode as being valid for the current code.
  - Maintain an efficient lookup mechanism, preferably without loss,
    so that page offset to tcode addresses are stored and survive across
    linear address decoupling/re-coupling.  Loss means we will have to
    retranslate some addresses, and potentially waste buffer space.
  - Handle writes to code pages.
  - Efficiently handle complete code page dumps along with freeing of
    the tcode buffer area and lookup table space.  DMA emulation, for
    example, necessitates this.

This next figure, depicts a scenario which handles most or all of
these requirements, and works in conjunction with the hash table
in the first figure.

Linear to meta index mapping


There are 2 main focal points here.  In the middle of the figure
is small hash table which performs much the same function as the
instruction TLB does in the CPU - caching linear page translations rather
than dereferencing the multi-level page tables constantly.  Though,
instead of caching a linear to physical pairing, our table holds a linear
to index pairing.  The index, being an index into an array of structures,
each holding meta information relating to a page of memory which
has associated tcode.

As you see, this strategy is highly page centric, as each element
of the meta info array is specific to one page of memory.  I'll come
back to that.

Assuming that a search of the linear address did not produce a match
in the tcode address hash table presented in the first figure, then
we would lookup the meta info index relating to this page.  If any
translation has been done to this page, there may well be an entry
stored in this table, which is not concerned with the page offset
(the lower 12 bits of the address).  If a matching index is not
found, then branch handler or DT logic would have to defer to another
method to locate the meta index (if one exists yet).  Following are
a couple methods for doing this:

  1) Since there is also a small amount of meta information relating
     to each physical page of guest memory (entirely another structure),
     we can store the meta index in there for translated code pages.  
  2) Brute force search of the linear page address field in each
     meta info structure.

If the code page has not been seen before, then a new meta info
element can be allocated from a pool.  In any case, one a meta index
is located, it can be added to the TLB hash table.

Let's then assume that a matching index is found.  Each meta info
structure uses a sparse lookup table to map each of 4096 possible
offsets to corresponding tcode addresses, wherever they exist.
Certainly a lot of code pages will not require a densely populated
lookup table, because only a small amount of code in the page will
be executed.  This is why I chose the 3-level sparse lookup table.
Each level of the table corresponds to 4 bits of the page offset,
and simply contains an array of pointers (or just pool indeces) to
the next level.  The last level (L2) contains the actual tcode address.

It would make sense to store the first level array (L0) in the meta
info page, since you will always need that level.  The other levels
(essentially just 16 element arrays) can be allocated on demand from a
pool, and deallocated fairly easily too.  Thus, the storage requirements
of a given code page grow according to the amount of code we have
translated for it.

As I mentioned, the last level of indirection in this table lookup
produces a tcode address (assuming the instruction has been translated).
We then add the pairing of the source linear address and this tcode
address in the linear-to-tcode hash table.  From thereon, target
lookups will occur at high performance.

If there is no corresponding tcode, then we need to defer to the
DT engine which can begin translating the new target code into
the tcode buffer and populate the sparse offset tables accordingly.


The page constraints
====================

Before adding entries to either hash table, or translating new code,
it is important that we observe certain rules.  For example, before
adding to the TLB hash table, the page permissions should be checked
to make sure that the current privilege level has read (and thus
execute permissions) on the given page.  If so, then adding an
entry to the tables means that we don't have to check again.  Though
we will have to be careful to flush such entries across context
switches.

Likewise, when we are looking at a given branch target, we can use
the page constraints to determine based on the CS descriptor value,
if the target is accessible.  If so, then once we add it to the translation
hash table, no further checks are necessary.  Again we have to be careful
to flush that buffer on context switches.

So the page constraints allow us to re-couple and revalidate translated
code pages with respect to associated linear addresses without reiterating
any code translation.  Once the linear page to meta index is re-coupled
(added to the TLB), fine grained linear addresses can be re-coupled via
the sparse lookup tables in the meta page, and the address pairings
stored in the linear to tcode hash table.  After a certain amount of
rebuilding overhead following a context switch, the lookup tables will
operate at full performance again.

I chose this technique to be page centric, since the re-coupling
could then simply mirror the native paging system.  We may be able to
extend the meta info structure to handle (possibly aligned) groups
of N pages.


The tcode buffer storage scheme
===============================

We need an efficient method of storage for new tcode fragments.
This storage method must work well with the page-oriented techniques
expressed so far.  Additionally, because tcode may generate
exceptions in situations where the original guest code would have
done so, we need the ability to map from any tcode address to
the associated guest instruction address.  This mapping will let
us handle the exception at the proper guest address.

The need for this extra reverse address mapping adds yet another
requirement to our list, so we need a solution which is efficient
in both storage and speed.

The following figure depicts a solution which is well suited for
the given needs.

TCode storage and mapping to iaddr


The tcode buffer is actually just a contiguous region of memory, which
can be thought of as a series of 256-byte chunks.  These "chunks" will
be allocated on demand for the storage of tcode associated with the
actual 4096-byte pages of guest code.  Once enough code from each
actual page of guest code is translated, to fill a chunk, another
one is allocated.  This lets tcode storage be highly associated with
actual pages, and yet not waste an entire 4096 page of storage for
many code pages whose code will be sparsely translated.

We will then need to store some meta information regarding each
tcode chunk, to facilitate a tcode to real instruction address reverse
lookup.  The graphic details some of the pieces of meta information
that we need to maintain.  For instance, the 'next available index'
keeps track of the free space in the chunk.  The 'page meta index'
backlink points to the element in the meta info array (as depicted
in the previous graphic entitled "Translating from linear page addr to
page meta info index".  And we also store a pointer to (or possible the
first level frame of) a sparse table for the reverse lookup.

For a first effort, and for consistency with the forward lookup tables
(guest instruction address to tcode address), I chose each level of
the table to cover 4 bits of address.  With this, only two levels
are needed to cover the 256-byte (8 bit) tcode chunk address space.
Other combinations are possible, but the lookup table parameters
need to match the tcode chunk size.


Storage of sparse table frames
==============================

Each level of the sparse tables, is represented by a data structure
which I'll call a 'frame'.  For each of the initial levels, each element
of the frame is a pointer to another frame.  The last level contains
elements which represent the data or pointer we are looking for.

More talk about storage of these data structures to come...