Chapter 1. A quasi dynamic translation architecture for plex86: supplanting SBE


A quick overview of the Scan Before Execute (SBE) strategy
==========================================================

The SBE strategy was chosen as our initial implementation because
it was more simplistic to implement, and thus less prone to obscure
bugs.  In a nutshell SBE works on a page-by-page code basis, dynamically
creating virtualized code pages which mirror the real code pages.  The
only current modifications necessary are to replace the first byte of
instructions we want to virtualize with INT3 instructions.  The rest
of the instructions are identical.  We use a trick which exploits the
split I&D TLB caches to temporarily place the modified code page in the
address space where the real code page lives, and effectively makes the
page 'execute-only', an attribute not naturally provided by the IA32
paging.  This simplifies a lot of things.

As I mentioned, the main benefit of this strategy is that it was quite
simplistic.  The 'translated' code is merely an echo of the real
code with breakpoint instructions replacing instructions which need
to be virtualized.  Code offsets are maintained, and thus there is
no need to store meta information regarding where translated code
fragments are stored etc.

The downside of this strategy is that the technique for virtualizing
certain instructions (replacing with INT3) is performance expensive.
Each execution of that instruction invokes an exception, which transitions
from ring3 (where the guest is executed) to the ring0 (where the monitor
is executed), processing in the monitor, and a return transition back
to the guest.  Since this INT3 is not specific to any particular
virtualized instruction, processing has to be more generic and thus
even more expensive.  If there are a large number of out-of-page
branches (for example, C function calls outside of the page) or
calculated branches (for example, optimized switch statements or
dereferenced function calls), then performance penalties can be big.


Quasi dynamic translation: allowing for more flexible generated code
====================================================================

Rather than generating modified (breakpointed) code which is very
rigid in the sense that instructions are always placed at exactly
the same offsets as the original code, I propose here a "quasi dynamic
translation" scheme, which can allow for more flexibility and performance
in the generated code.  Here, the generated code sequence for a given
guest instruction can be arbitrary.  This allows us the flexibility
to insert calls to handler code, rather than generate traps to
the VM monitor, as well as a place to add instrumentation in the
future.  However, the general design philosophy here is in fact,
generating code which is a simple 1:1 translation, whenever possible.  Or
in other words, we just pass the instruction through unmodified.

Traditional dynamic translation techniques tend to be quite heavy,
especially when emulating IA32, due to modeling of the various
protection checks, 8/16/32-bit register usage, arithmetic
flags handling, scheduling of native registers across static
and dynamic branches, etc.  Thus, for a generic IA32-to-any architecture
dynamic translation, there is often a large amount of code overhead
that needs to be generated to handle certain IA32 code properly.

But, since plex86 is specific to IA32 capable hosts, we can heavily
exploit the natural mechanisms whenever possible, and eliminate
such overhead and complexities in the translation.

A large amount of IA32 code consists of user-oriented instructions which
access the general registers, arithmetic flags, and user segment
registers.  It is quite possible to use these instructions as-is,
eliminating to a high degree, an amount of management of such
constructs, and without worry about scheduling of resources
across branches.

The reason I call it "quasi" dynamic translation, is because we
really don't do much translation at all.

Using the guest data segments directly
======================================

All guest data segments {ES,SS,DS,FS,GS} are virtualized by
monitor such that they can be used directly by translated
code, thus most translated code can be a 1:1 pass-through.  Only
the DPL is currently adjusted.  This lets the native segmentation
limit/access checks do the work for us.

Similarly, as long as we are careful that the values of general
registers and the arithmetic flags contain the values expected
by the guest code, we can pass-through instructions which make
use of them.  As such, whenever we generate code which is not a mere
pass-through of guest instructions, we need to save/restore such
guest register state, so that the assumption of native use will be
valid for following code.

Virtualizing the guest CS segment accesses
==========================================

The guest code segment CS, needs to be virtualized differently.
Since we would be executing code from a translation buffer, the addresses
generated by accessing the CS segment are unrelated to those in
the real guest CS segment.  In general practice, data accesses via a
'CS:' prefix are not very common.  A simple approach to handling these
is to virtualize these instruction so they are emulated by the monitor.
Perhaps later, we could generate code to handle these instructions
directly or generate calls to special handler routines.  The main use
of the CS segment is of course, from branch instructions.

Some branch instructions involve a simple static offset, which is applied
to the CS.base value to determine the linear address of the target code.
Other branch instructions use a calculated (dynamic) offset, which can
not be determined at translation time.

Dynamic branch targets
======================

We need a mechanism for dynamically determining if there is already
translated code for a given guest target address.  If so, then we can
jump to it.  If not, then code at the target address must be translated
first, followed by a jump to the newly translated code.  There are
possibilities of performing the translation in either the ring0 monitor
space, or by ring3 handler code.  The monitor space is a good first
effort place to do this.

To accomplish this, we can generate code which calls a common branch
handler routine.  This routine would be responsible for saving any
guest state it needs to modify, performing the branch target lookup,
invoking a translation routine if necessary, restoring any modified
guest state, and transferring to the translated code.  It would be a more
efficient use of translated code buffer space and cache-lines to
centralize this sort of functionality into a few optimized handler
routines, rather than inline them in many places in generated code.

Static branch targets
=====================

A simple-minded approach to handling these, would be to use the
same handling routines as for dynamic branch targets.  This might
be a good initial approach, as it is quite simple, though it does
not produce optimal code.

At the other end of the spectrum, we could first translate code at the
target location, and then generate a direct jump to that translated code.
This eliminates the intermediate branch handling routine.  At first glance,
this seems like an end-all obvious solution.  However, consider when
the guest modifies its own code.  Such modifications require an invalidation
of associated translated code.

Imagine that there may be many translated sequences of code which
make calls to a routine which has been modified.  If we were to
have hardcoded calls in the calling code, to the invalidated
routine, then we would need a way to propagate the invalidation
back to calling instructions.  And to instructions which called
those instructions, etc.  Thus, if we hardcode branches, we need to
maintain a representation of control flow as we dynamically translate code.
Or simply flush all translate code.

There are ideas which lie somewhere in the middle of these two
approaches.  For example, if we look at regions of code as single
entities, then static branches within the region could be generated
simply as static branches to generated code.  Branches outside the
defined region could be coded to use the dynamic branch handler.
Invalidation of code within a given region could then invalidate only
that region of translated code.  This would eliminate the need for
maintenance of control flow information, at the expense of introducing
extra overhead for longer branches.  The notion of a region is arbitrary,
but perhaps some multiple of the 4k page size is convenient.

The idea of viewing code in atomic regions also has the advantage of
allowing the code regions to be re-optimized as entities, without
interfering with external calling code, for much the same reasons
as code invalidation.

Handling self-modifying code
============================

As in the SBE method, code pages which have been marked as containing
guest code for which we have corresponding translated code, are marked
with read-only permissions.  The page fault gives us the opportunity
to invalidate translated code related to that page, or specific range
of affected addresses.

Handling self-examining code
============================

This is not a problem, as data accesses can work naturally, and see
the actual untranslated guest code.  Data accesses via the CS: prefix
are virtualized.

Correlating translated instructions with their guest addresses
==============================================================

It is important to maintain information which allows for the correlation
between the addresses of translated code and the associated guest code.
The translation logic needs to convert from guest address to translated
code address, and the exception handling logic needs to convert the
other direction, to find what real guest address effectively caused
the exception.  At least some of this information needs to be mapped
into the address space such that the special branch handling routine
function.

An illustrated view of how it would work
========================================

Examining control flow


In the diagram, there is an assembly code fragment which represents some
arbitrary code we are translating at a given time.  Imagine that we
are translating beginning at 'label10'.  Graphic (a) depicts in abstract,
the control flow of the instruction stream as the translator scans it.
Non-branch instruction, such as the 'addl', have only a forward flow
as the next instruction is implicitly executed.  Conditional branch
instructions have two possible targets, the specified branch target
if the condition is met, and the next instruction if it is not.
For example, the 'jcs' and 'jz' instructions are conditional branches.
In implementation, these instructions might be translated by generating
branches to branch handling routines.  Unconditional branches, such
as the 'ret' instruction, have only one possible flow.

For each potential path of control flow, where the target is known,
the translator can reiterate the process.  When the flow hits an
unconditional branch, the translation process can end.

Now imagine that at some point in the future, we begin translating
at a new address, 'label9'.  As we have already translated the sequence
at 'label10' which follows, we would not necessarily want to retranslate
the sequence again.  Since this newly generated code would be located
at an arbitrary place in the translation buffer, it is not easy to
align it so that execution will naturally flow into the translated
code sequence for 'label10'.  Thus, we can simply code a branch to
it instead, creating sort of an artificial flow control depicted
in (b).


Re-optimizing translated code
=============================

Because of the timing of when the translation logic encounters certain
code sequences, the situation in (b) will occur occasionally.  It
is not a huge loss, but jumps do flush the instruction pipelines,
wasting cycles.  If this instruction is involved in a tight loop, the
impact can be substantial.  We could view figure (b) as instead (c),
noting that the extra colored shim represents an extra branch placed
there because the code sequences are not really contiguous.

If this translated code is not executed much, there is not much
point in optimizing.  However, if it is evident from whatever use
data we can gather, that this translated code is used frequently, there
is an optimization which can be done.

It is possible to regenerate the code region, reordering translated
sequences such that contiguous ones are placed in the buffer as such,
eliminating extra branch instructions previously imposed by the
translation engine.  This is depicted in (d).

This reordering could be occasionally done by the VM monitor, though if
the translation information is available to the host user space program,
it may be possible to offload it to a separate user space thread.
This would allow such a thread to execute on an idle processor.
This process could look for code regions worth optimizing, re-order
them, and then post the results to the monitor.  The monitor would
be free to accept the changes, or ignore them based on updates to the
region or other reasons.

Dealing with 16-bit code
========================

[Fill this in]