# Dynamic Wait-States for the W65C02, W65C816

## What Exactly Is the Problem?

In most 6502-based designs to date, the CPU clock is derived by the needs of
the I/O.  A Commodore 64's clock is generated by the VIC-II chip, for example,
and its rate depends on whether it is an NTSC (1.02 MHz) or PAL (0.98MHz) video
standard.  Access to RAM and other I/O is designed carefully to fit within
these hard real-time requirements.  There is never any need for wait-states,
because every chip knows (basically) what all the other chips are doing.

However, that really only works for extremely well-specified products,
or, products for which you have total vertical control over the
components used (such as Commodore and Atari, which designed their own chips
specifically to work with the 6502 processor).  Under these conditions, the
very simple bus that the 6502 exposes to the world is an absolute joy to work
with.

## Why Is It a Problem?

Let's put our adult lives on hold and become as children for a bit, and make
believe.  Your mission from your boss at IBM: the 8088 project was an abject
failure, and now you're to build a 6502-based computer that supports up to 8
expansion slots for I/O devices to compete with the likes of Apple and
Commodore.  The mainboard of the computer is to be as independent as possible
of both the peripheral speed (because we can't predict how slow the slowest
add-in card will be) and the CPU clock speed (the fastest 65C02 of the
day was 4MHz at the time the PC came out, I think; we know that it now runs up
to 20MHz these days).

More importantly, economics will eventually change such that parallel ROMs (to
store the BIOS) will eventually give way to serial flash devices.  how would
you get a 14MHz CPU with a parallel bus to boot off of a serial flash, which
might then interact with a 5MHz VIA chip from Western Design Center?  This is
not an academic exercise; if you work with deep embedded systems today,
especially ARM or RISC-V processors, you might already be familiar with
specialized circuitry to accomplish this exact goal.  If it's not on the CPU
itself, it'll certainly be located on what is colloquially known as its
chipset.

You might think that the simplistic bus interface of the 65C02 is insufficient
to address this level of bus asynchrony.  However, with only a small
amount of external logic, you can actually implement an asynchronous bus on par
with the Motorola's 68000.  Granted, the 65C816 is a slightly better
choice for this, but the principles will be the same.

## Asynchronous Bus for 65C02

The first thing we need to realize is that the circuits described below will
only work with the CMOS 6502 parts, not with the NMOS parts.  This is
because the NMOS parts observe the state of the RDY signal only when reading,
not when writing.  Further, NMOS parts only sample RDY during phase-1, while
CMOS parts sample it during phase-2-to-phase-1 transition, giving more time for
address decoders to work.  That said, if you're particularly clever with
write-caching, you can probably take the ideas in this article and apply them
to an NMOS design as well.

What we know is that we want the CPU's RDY signal to drop low when a slow
device is first addressed.  We can use an address decoder's select line to know
when a slow device is addressed; however, we can't always know from this signal
alone when one bus transaction stops and another begins.  Consider a
back-to-back read of LDA #$11 from serial flash.  This two byte instruction
will cause the CPU to read from serial flash, back to back and without
interruption.  The serial flash's select line would assert once during
this (assuming it wasn't already asserted), not the two times one might expect.

Thankfully, we already know that the 65C02 completes a bus transaction on every
cycle where RDY is asserted.  Therefore, if RDY is asserted during cycle n-1,
then we know that cycle n must be the start of a new cycle.  We can
therefore capture this cycle-start as a new signal to be shared with all
devices, which I'll name START, like so:

                 +------+
    RDY o--------|D    Q|--------> START
                 |      |
    PHI2 o------o|>     |
                 +------+

The 65C816 adds a little bit of complexity thanks to the VPA/VDA signals.
These can be used to actually qualify valid versus internal bus cycles,
allowing internal cycles to always run at maximum speed.

                 +------+
    RDY o--------|D    Q|----.
                 |      |    |
    PHI2 o------o|>     |    |    +------+
                 +------+    `----|      |
                                  |  *1  |--------> START
                 +------+    .----|      |
    VPA o--------|      |    |    +------+
                 |  +1  |----'
    VDA o--------|      |
                 +------+

If you don't care about introducing wait-states for internal cycles, then you
still use the 65C02 circuit.

Once we have this new signal, we can qualify it against an address decoder's
select output to kick off a timer of some sort.  After this timer expires, the
peripheral addressed will drive its own personal RDY signal, which should cause
the CPU to continue.  Basically, we are looking for a timing diagram similar to
the following, depicting two back-to-back hits on a slow device:

                     ____      ____      ____      ____      ____      ____      ____
        PHI2    ____/    \____/    \____/    \____/    \____/    \____/    \____/
                __________ _______________________________________ __________________
        ADDR    __________X_______________________________________X__________________
                ___________
        SLO#               \_________________________________________________________
                _____________________                               _________
        START   _____________/       \_____________________________/         \_______
                ______________                            _________
        SLORDY  ______________\__________________________/___/     \_________________
                ________________                               _______
        RDY     ______________\_\_____________________________/     \_\______________

Basically, when the slow device is selected and we know it's the start
of a bus cycle, we can start a state machine that drives SLORDY low until the
right time.  With our example above, a divide-by-four circuit would decode
count=3 to drive SLORDY high, while (START /\ SLO#) would reset the counter
back to 0.

                   ,-------------------------------------*--------------.
                   |   +------+                          |              |
        START o----*---|      |        +-----+        +-----+        +-----+
                       |  *1  |--------|D   Q|--------|D R Q|--------|D R Q|--------> SLORDY
        SLO#  o-------o|      |        |     |        |     |        |     |
                       +------+    .--o|>    |    .--o|>    |    .--o|>    |
                                   |   +-----+    |   +-----+    |   +-----+
        PHI2  o--------------------*--------------*--------------'


What we basically have here is not much different from what we'd find in a
typical DTACK-generator for a MC68000-based computer.

(Remember, this circuit is only representative; the precise state machinery
necessary for your peripherals will likely look very different.)

OK, we've used START and our device-specific select to identify when to start
our RDY-state-machine; but how do we route that signal back to the CPU?  As you
might imagine, just as the device driving the data bus is controlled by the
select, so too is the RDY signal.

                       +------+------+ -
        VPA    o------o|  *1  |  +1  |  |_ for 65C816-based designs only.
        VDA    o------o|      |      |  |
                       +------+      | -
        SLO#   o------o|  *1  |      |
        SLORDY o-------|      |      |
                       +------+      |
        ROM#   o------o|  *1  |      |
        ROMRDY o-------|      |      |
                       +------+      |
        RAM#   o------o|  *1  |      |
        RAMRDY o-------|      |      |
                       +------+      |--------> RDY (to the CPU)
        S0#    o------o|  *1  |      |
        S0RDY  o-------|      |      |
                       +------+      |
        S1#    o------o|  *1  |      |
        S1RDY  o-------|      |      |
                       +------+      |
                      ///    ///    ///
                       +------+      |
        S7#    o------o|  *1  |      |
        S7RDY  o-------|      |      |
                       +------+------+

It's as easy as that.

What happens if the CPU addresses a block of memory which isn't decoded?  That
seems like it would jam the processor until the next hard reset.  Indeed, that
is the case.  As presented here, I only account for decoded devices.  There are
many ways of handling the case of a bus error, however.  One approach is to
fully decode the address space and include a "default RDY generator" that
applies to all otherwise unused portions of memory.  (For 65C816 devices,
perhaps you might also want to pulse the ABORT# signal too.)  Another approach
is to have a default RDY generator which is OR-ed with the signal above as a
fail-safe.  The START signal acts as a watchdog timer reset for this circuit,
ensuring it never fires spuriously.

## Why Is This Solution Valuable?

Bus asynchrony brings potential compatibility with a wider variety of
peripherals, and/or enables the use of design methods with more favorable
economics.  For example, asynchrony is a vital requirement for compatibility
with the STE-bus specification.

The implementation costs a handful of D-flip flops, and can be successfully
implemented with a bunch of 2-input AND and OR gates.  Clever engineers might
use 74138-style 1-of-8 decoders as well to reduce discrete component counts.

The 65C02 and 65C816 often appear in circuits which are extremely
cost-sensitive and fixed in function.  All of the logic discussed above adds to
the circuit complexity, and thus, to the overall cost of development.  Thus, if
you are working with discrete components, you might want to forego this
additional complexity and stick with fully synchronous designs.