# Thinking About Implementing Uxn in FPGA Just for funsies, I've been thinking about how I would go about implementing the Uxn virtual machine as a physical machine instead. NOTE: I'm not committing to undertaking such a project. I just don't have the time for it right now. This is more of a thought exercise. Uxn describes an ML0 CISC stack computer intended primarily for emulation by a more powerful host processor. The goal is to be able to implement an interpreter for its instruction set in maybe a hundred or so lines of C code. It is extremely successful at this task; however, I thought creating a hardware implementation might be a fun side project for future-me to undertake some day. With only 36 core operations, what makes it a CISC architecture? Ironically, its stacks! RISC instructions are characterized not by execution complexity, but rather by how easily they can be decoded into rapidly executing steps, with the ideal being one clock per instruction (often with the help of pipelining). Because Uxn's stacks are large (256 bytes each), it is not very practical to have them inside the processor core (at least for smaller FPGAs). Thus, to push values to a stack (or pop a value from the stack), you must perform a memory transaction and update a pointer, assuming a bounds check succeeds. These steps conflate an exception reporting mechanism (to report stack over-/under-flow), memory I/O (including the support for wait states if required), and load/store mechanisms. The following matrix illustrates how the behavior of the processor changes depending on datum size and whether you're pushing or popping. 8-BIT 16-BIT PUSH IF SP=$FF THEN IF SP=$FE OR SP=$FF THEN HALT(OVERFLOW) HALT(OVERFLOW) ELSE ELSE MEM8(SP) := value8 MEM16(SP) := value16 SP := SP + 1 SP := SP + 2 END END POP IF SP=$00 THEN IF SP=$00 OR SP=$01 THEN HALT(UNDERFLOW) HALT(UNDERFLOW) ELSE ELSE SP := SP - 1 SP := SP - 2 value8 := MEM8(SP) value16 := MEM16(SP) END END Instructions have a flag bit to indicate whether they operate on 8-bit or 16-bit quantities. They also have a flag bit to indicate whether items are genuinely popped off the stack as well; this flag is the so-called "keep" flag. If set, the keep flag tells an instruction to read the contents of the stack like normal, but to leave the stack pointer right where it's at, so that when results are pushed, they appear to be appended to the stack frame instead of replacing it. As you might imagine, this fundamentally alters the behavior of the push and pop operations along a different dimension. 8-BIT (keep) 16-BIT (keep) PUSH IF SP=$FF THEN IF SP=$FE OR SP=$FF THEN HALT(OVERFLOW) HALT(OVERFLOW) ELSE ELSE MEM8(SP) := value8 MEM16(SP) := value16 SP := SP + 1 SP := SP + 2 END END POP IF KP=$00 THEN IF KP=$00 OR KP=$01 THEN HALT(UNDERFLOW) HALT(UNDERFLOW) ELSE ELSE KP := KP - 1 KP := KP - 2 value8 := MEM8(KP) value16 := MEM16(KP) END END While pushing values hasn't changed behavior at all, popping is a different matter. Notice how instead of popping from the stack by updating its stack pointer, we instead fetch memory using a keep pointer (KP) register instead. NOTE: KP is not an architectural register; it is not visible to the Uxn programmer. In order to support this without having at least two copies of microcode lying about for each instruction, the idea is that for each instruction, the KP register is set equal to the corresponding SP. Then, all of the instruction's necessary data are popped off the stack. If the keep flag is not set, then and only then do we set SP equal to KP. Process the data as required, and finally push all the results when done. In summary, this is the generic template for a Uxn instruction: fetch next instruction into instruction register. KP := SP pop all necessary parameters IF keep flag cleared THEN SP := KP END process and produce results. push all results. This follows the usual processor pipeline stages of "instruction fetch", "instruction decode", "execute", and "writeback", but with a few unique additions to account for the keep flag. Since the stacks are accessed so much more than arbitrary memory, it makes sense to consider using a cache for each of the working and return stacks. That would give the processor time to fetch instructions and refill the instruction queue/cache (depending on how we implement it). However, this isn't a hard requirement; if each stack touched memory just like instruction fetches and arbitrary loads and stores did, the result would be a processor with the sequential execution characteristics of earlier processors like the Intel 8080 or TMS9900. Assuming each hit to memory costs one clock cycle (the fastest possible Wishbone transaction), then something like an ADD2 instruction would take 8 clock cycles, while a ROT2 would take an estimated 14 clock cycles to complete. Without any additional optimization, I reckon this would put such a design about on par with a Z80 in overall performance. But, with caches and an instruction queue mechanism (or its equivalent), we can: 1. Combine KP := SP and instruction fetch from the queue in a single cycle operation. 2. Pop all necessary parameters in a single cycle if they are resident in the cache. 3. Optionally reset the SP register based on the keep flag while performing the desired operation. 4. Push all results in a single cycle, again if they'll fit in the current stack cache line. This has the potential to reduce a single instruction's runtime to just 4 cycles, assuming the caches are hot and everything fits. These four cycles can be overlapped in whole or in part with instruction fetch to fill the queue; therefore, to an outside observer, it's conceivable that they'll measure only around 3 cycles of latency per instruction. This is a huge performance benefit. (This is Intel's solution to everything: all we need to do is throw transistors at the problem, I guess.)