# A Hyper-pipelined S16X4A Successor?

## How S16X4A is structured now

The diagram below illustrates how the S16X4A's data path is currently
structured.  Things to note:

* X, Y, and Z registers form the 3 element data stack.

* All CPU operations either push values onto the stack, pop them off, or
replace the top value of the stack in-place.

* The ALU pre-computes all possible stack outcomes at once, and the opcode just
selects which one becomes the next cycle's stack state.

                    |     |
                    V     |
+---+    +---+    +---+   |
| X |<-->| Y |<-->| Z |   |
+---+    +---+    +---+   |
           |        |     |
           `---. .--'     |
               | |        |
               V V        |
             +-----+      |
             | ALU |      |
             +-----+      |
                |         |

Such a processor is typically called a MISC architecture processor, because it
represents a degree of simplification that exceeds that of RISC: Minimum
Instruction Set Computer.

## Problems with S16X4A

It's actually quite a zippy little processor for how simple it is.  It can
readily compete with a 65816 processor at most tasks, which implies that this
thing can easily play in the 68000's performance playground with ease.

But, there are tasks where the 65816 and 68000 can easily defeat it.
Especially those tasks that involve shifts and rotations.  As it is currently
built, it relies heavily on combinatorial logic for its ALU.  The adder
synthesizes to a 16-bit ripple-carry adder in most (all?) cases.  Barrel shifts
and rotates would add considerable propagation delay, if we were to extend its
instruction set to include those operations (which I do).

If synthesized alone, the processor can probably reach upwards of 60MHz in
performance.  However, when synthesized in the context of a larger design,
maximum processor speed drops rapidly.  It typically will top off at or
slightly below 30MHz.  Much of this drop-off comes from the bus interface; when
loading or storing data to/from the Z register, the Wishbone bus appears as a
combinatorial circuit in between two sets of registers.  But, assuming I fix
the bus interface to not use such long wires, that still constrains the
processor core itself to a speed significantly less than what most FPGAs are
capable of attaining.

## IDEA: Hyper-pipeline the S16X4A

I am wondering about the feasibility of hyper-pipelining the S16X4A while
preserving its desirable characteristics.

Hyper-pipelining will introduce more latency to compute a result.  However, if
I'm correct, the ability to run the processor faster will more than make up for
this loss, leading to a net gain in overall performance.

Consider 16-bit addition.  For FPGAs today, this is rarely a performance
bottleneck; but, here me out.  If we break the addition up into four separate
nybbles, and pipeline between each, we get a data flow similar to this:

+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
| Z3 | Z2 | Z1 | Z0 |<-->| Y3 | Y2 | Y1 | Y0 |<-->| X3 | X2 | X1 | X0 |
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
                                           |                        |
                                           `-----------. .----------'
                                                       | |
                                       to c0 DFF <---| ADD |<--- c_in
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
| Z3 | Z2 | Z1 | Z0 |<-->| Y3 | Y2 | Y1 | Y0 |<-->| X3 | X2 | X1 | X0 |
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
                                      |                        |
                                      `-----------. .----------'
                                                  | |
                                         c1 <---| ADD |<--- from c0 DFF
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
| Z3 | Z2 | Z1 | Z0 |<-->| Y3 | Y2 | Y1 | Y0 |<-->| X3 | X2 | X1 | X0 |
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
                                 |                        |
                                 `-----------. .----------'
                                             | |
                                    c2 <---| ADD |<--- from c1 DFF
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
| Z3 | Z2 | Z1 | Z0 |<-->| Y3 | Y2 | Y1 | Y0 |<-->| X3 | X2 | X1 | X0 |
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
                            |                        |
                            `-----------. .----------'
                                        | |
                            c_out <---| ADD |<--- from c2 DFF
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+
| Z3 | Z2 | Z1 | Z0 |<-->| Y3 | Y2 | Y1 | Y0 |<-->| X3 | X2 | X1 | X0 |
+----+----+----+----+    +----+----+----+----+    +----+----+----+----+

For example, if Y=$1FFE and Z=$0002, let's see what happens in the Y and Z
registers as the ADD instruction propagates from the least to the most
significant nybble.

Clock 0: Y=$1FFE Z=$0002 (add nybble 0)
Clock 1: Y=$1FFx Z=$0001 (nybble 0 has partial sum; Y partially popped; add nybble 1)
Clock 2: Y=$1Fxx Z=$0001 (nybble 1 has partial sum; Y partially popped; add nybble 2)
Clock 3: Y=$1xxx Z=$0001 (nybble 2 has partial sum; Y partially popped; add nybble 3)
Clock 4: Y=$xxxx Z=$2001 (Y=X and Z=Y+Z)

This addition operation would take four clock cycles to complete.  Alone, this
would cause a 4x performance hit in program execution speed.  However, we can
make up for this by allowing multiple instructions to flow through this
pipeline, sequentially.  We can do this because all instructions work on data
in the same order, and rarely need global access to all bits at once.

Consider the operation #(-1) XOR #(1) ADD (used to compute the 2's compliment
value of Z):

Clock 0: Y=$yyyy Z=$0002
Clock 1: Y=$yyy2 Z=$000F  #(-1)
Clock 2: Y=$yy0y Z=$00FD  #(-1) XOR
Clock 3: Y=$y0yD Z=$0FF1  #(-1) XOR #(1)
Clock 4: Y=$0yFy Z=$FF0E  #(-1) XOR #(1) ADD
Clock 5: Y=$yFyy Z=$F0FE        XOR #(1) ADD
Clock 6: Y=$Fyyy Z=$0FFE            #(1) ADD
Clock 7: Y=$yyyy Z=$FFFE                 ADD

In the original S16X4 design, this sequence can complete in five clock cycles
(1.25 cycles per instruction).  Here, we're executing four instructions in
eight clock cycles, for a total throughput of two cycles per instruction.
However, if the resulting design allows me to clock the processor core at a
rate faster than 2x the current design, then we gain in real-world performance

## Too Good to be True

It turns out the following S16X4 instructions can run with this kind of data

* LIT (#)

There are many which cannot, however.

* Fetch and store instructions (FWM, SWM, FBM, and SBM).
* Control transfer instructions (GO, ZGO, NZGO, ICALL, LCALL).

These require the state of the stack to be stable before they execute, and so
must wait for all data processing through the hyper-pipeline to complete before
taking action.  As a result, these basically incur an additional n-1 cycles (3
in this case), which I believe negates the benefit of hyper-pipelining.

It is interesting that LIT can be pipelined but fetches and stores cannot be.
Although LIT is implemented literally as a fetch from the program counter, it
can get away with pipelining because the program counter's state is always
available in its entirety.  It doesn't need to wait for its value to stabilize.

Maybe there is a way to redesign some of these instructions to not depend so
heavily on the state of the stack?

For fetches and stores, we could use an address register.  Instructions like A!
and A@ can be used to set and retrieve the value of the address register.  But,
to fetch and store *through* the address register, we use instructions like !A
and @A.  An instruction like FWM would be broken into smaller pieces: A! NOP
NOP NOP @A.  The NOPs would be used for timing; in this case, A! would ripple
across the Z register moving 4-bit parts into the A-register.  Once the A
register was loaded, @A and !A would work as expected.  This would be costly
for random access (wasting three instruction slots every time A would change).
However, for pre-decrement and/or post-increment fetches and stores, this would
make streaming over data as fast as the LIT instruction.  Clever code would
even make use of those delay slots to amortize the cost of setting the
A-register.  Consider the Forth "varA @ varB ! varC @ varD !" sequence:

# A! # A! # @A . !A A! # A! . @A . !A
---- ----   --   -- -- ----   --   --
 |    |     |    |  |   |     |    |
 `----------'    |  `---------'    |
      |          |      |          |
      `----------'      `----------'

This would take 15 cycles to execute, and would move 4 bytes (2 bytes per cell,
times two cells), for a transfer speed of about 1 byte every four cycles.  This
compares favorably with the speed of most other 16-bit CPUs when moving data
between arbitrary addresses.

However, the opportunities for code re-ordering are fairly limited in stack
architectures, so it's not clear that this is a valuable technique overall.
And, with how frequent the A register would be reloaded, the only way to really
make this worthwhile is to have lots of registers, and correspondingly, lots of
fetch and store instructions for them.

## Conclusion -- a.k.a., Going About This the Wrong Way

So far, we've taken the approach of executing instructions.  We've lost sight
of the original design detail: instructions don't do anything; they
merely select the next state from a set of all possible states generated
by different circuits.  These circuits will vary in complexity, and therefore,
may take longer to produce a stable output.  Doing things is a very
top-down approach to implementing a processor.  Selecting things,
however, is very bottom-up.  In this way, instructions always,
unconditionally, take one clock cycle to execute.  Whether the state
selected is valid or not is a software concern.

So, we can still hyper-pipeline the adder, shifter, etc. if we wanted to.  But,
with this approach, there's no need; it can be a sluggish, fully combinatorial
design as well.  However, it now becomes a software concern to wait long enough
for the desired state to settle.  For example, if we break the adder into two
8-bit halves, we can configure it so that the low byte of the sum is valid
immediately, and the high byte a cycle later.  Thus, to perform an 8-bit
addition where you're confident the carry doesn't propagate, you just code the
ADD instruction like normal.  But, if you desire a full 16-bit wide addition,
you'd have to insert a NOP in front of the ADD, so as to give the
adder's output time to settle.  Assuming, of course, that you're clocked so
fast that this is even a concern to begin with.

Shifts and rotates would work the same way.  A barrel shifter is convenient to
have; but, a high performance, 64-bit implementation will require over 4096,
7-input gates to implement.  A more compact implementation uses a stack of
switches (one of which either doesn't shift at all or shifts by 32 bits; the
next optionally by 16 bits; etc.) that can take up to four to six propagation
delays (16-bit to 64-bit, resp.) to compute the final results.  As a result,
you might need to insert four NOPs in front of a multi-bit shift instruction;
on the other hand, it's still significantly faster than 15 single-bit shift

For variable-latency operations, though, it might be better to introduce the
concept of waiting for something to finish.  For example, memory loads
and stores (again) or division computations.  These can take a significant
amount of time.  It might be a good idea to pop some values off the stack and
into some control registers, which kicks off some background operation.  Then,
have a second opcode pushes the results of the computation along with a
done-flag or something.  That way, a ZGO or NZGO instruction can be used to
loop until the results are known valid.  The processor can keep working on
other things in the meantime.

Once again, Chuck Moore was right.