# Thinking About Implementing Uxn in FPGA

Just for funsies, I've been thinking about how I would go about implementing
the Uxn virtual machine as a physical machine instead.

NOTE: I'm not committing to undertaking such a project.  I just don't
have the time for it right now.  This is more of a thought exercise.

Uxn describes an ML0 CISC stack computer intended primarily for emulation by a
more powerful host processor.  The goal is to be able to implement an
interpreter for its instruction set in maybe a hundred or so lines of C code.
It is extremely successful at this task; however, I thought creating a hardware
implementation might be a fun side project for future-me to undertake some day.

With only 36 core operations, what makes it a CISC architecture?  Ironically,
its stacks!  RISC instructions are characterized not by execution complexity,
but rather by how easily they can be decoded into rapidly executing steps, with
the ideal being one clock per instruction (often with the help of pipelining).
Because Uxn's stacks are large (256 bytes each), it is not very practical to
have them inside the processor core (at least for smaller FPGAs).  Thus, to
push values to a stack (or pop a value from the stack), you must perform a
memory transaction and update a pointer, assuming a bounds check succeeds.
These steps conflate an exception reporting mechanism (to report stack
over-/under-flow), memory I/O (including the support for wait states if
required), and load/store mechanisms.

The following matrix illustrates how the behavior of the processor changes
depending on datum size and whether you're pushing or popping.

                8-BIT                   16-BIT

    PUSH        IF SP=$FF THEN          IF SP=$FE OR SP=$FF THEN
                  HALT(OVERFLOW)          HALT(OVERFLOW)
                ELSE                    ELSE
                  MEM8(SP) := value8      MEM16(SP) := value16
                  SP := SP + 1            SP := SP + 2
                END                     END

    POP         IF SP=$00 THEN          IF SP=$00 OR SP=$01 THEN
                  HALT(UNDERFLOW)         HALT(UNDERFLOW)
                ELSE                    ELSE
                  SP := SP - 1            SP := SP - 2
                  value8 := MEM8(SP)      value16 := MEM16(SP)
                END                     END

Instructions have a flag bit to indicate whether they operate on 8-bit or
16-bit quantities.  They also have a flag bit to indicate whether items are
genuinely popped off the stack as well; this flag is the so-called "keep" flag.
If set, the keep flag tells an instruction to read the contents of the stack
like normal, but to leave the stack pointer right where it's at, so that when
results are pushed, they appear to be appended to the stack frame
instead of replacing it.  As you might imagine, this fundamentally alters the
behavior of the push and pop operations along a different dimension.

                8-BIT (keep)            16-BIT (keep)

    PUSH        IF SP=$FF THEN          IF SP=$FE OR SP=$FF THEN
                  HALT(OVERFLOW)          HALT(OVERFLOW)
                ELSE                    ELSE
                  MEM8(SP) := value8      MEM16(SP) := value16
                  SP := SP + 1            SP := SP + 2
                END                     END

    POP         IF KP=$00 THEN          IF KP=$00 OR KP=$01 THEN
                  HALT(UNDERFLOW)         HALT(UNDERFLOW)
                ELSE                    ELSE
                  KP := KP - 1            KP := KP - 2
                  value8 := MEM8(KP)      value16 := MEM16(KP)
                END                     END

While pushing values hasn't changed behavior at all, popping is a different
matter.  Notice how instead of popping from the stack by updating its stack
pointer, we instead fetch memory using a keep pointer (KP) register
instead.  NOTE: KP is not an architectural register; it is not visible
to the Uxn programmer.

In order to support this without having at least two copies of microcode lying
about for each instruction, the idea is that for each instruction, the KP
register is set equal to the corresponding SP.  Then, all of the
instruction's necessary data are popped off the stack.  If the keep flag is
not set, then and only then do we set SP equal to KP.  Process the data as
required, and finally push all the results when done.  In summary, this
is the generic template for a Uxn instruction:

    fetch next instruction into instruction register.
    KP := SP
    pop all necessary parameters
    IF keep flag cleared THEN
      SP := KP
    END
    process and produce results.
    push all results.

This follows the usual processor pipeline stages of "instruction fetch",
"instruction decode", "execute", and "writeback", but with a few unique
additions to account for the keep flag.

Since the stacks are accessed so much more than arbitrary memory, it makes
sense to consider using a cache for each of the working and return
stacks.  That would give the processor time to fetch instructions and refill
the instruction queue/cache (depending on how we implement it).  However, this
isn't a hard requirement; if each stack touched memory just like instruction
fetches and arbitrary loads and stores did, the result would be a processor
with the sequential execution characteristics of earlier processors like the
Intel 8080 or TMS9900.  Assuming each hit to memory costs one clock cycle (the
fastest possible Wishbone transaction), then something like an ADD2 instruction
would take 8 clock cycles, while a ROT2 would take an estimated 14 clock cycles
to complete.  Without any additional optimization, I reckon this would put such
a design about on par with a Z80 in overall performance.

But, with caches and an instruction queue mechanism (or its equivalent), we
can:

1. Combine KP := SP and instruction fetch from the queue in a single cycle
   operation.

2. Pop all necessary parameters in a single cycle if they are resident in the
   cache.

3. Optionally reset the SP register based on the keep flag while performing the
   desired operation.

4. Push all results in a single cycle, again if they'll fit in the current
   stack cache line.

This has the potential to reduce a single instruction's runtime to just 4
cycles, assuming the caches are hot and everything fits.  These four cycles can
be overlapped in whole or in part with instruction fetch to fill the queue;
therefore, to an outside observer, it's conceivable that they'll measure only
around 3 cycles of latency per instruction.  This is a huge performance
benefit.  (This is Intel's solution to everything: all we need to do is throw
transistors at the problem, I guess.)