# Thoughts about Byte Enables

You're building a processor in an FPGA, and you need to move more than 8-bits
of data in a single cycle.  How do you tell the outside world which bytes are
valid?  The obvious way, explicit lane selects, works great but can take up a
lot of pins as bus widths get wider than 32-bits.  What to do?

In this web-memo, I think way too hard about this problem.

## First, a bit of background...

Let's consider the RC2014 backplane bus, which is essentially a plain-vanilla
Z80 bus interface.  You have 16 address lines, 8 data lines, and a handful of
handshakes and status bits indicating when a bus cycle starts, what kind it is,
how long it'll last for, etc.  Details really aren't important here.  What
is important, though, is that you can address 65536 bytes with the
address bus provided.  Conveniently, the bus can transfer exactly one byte at a
time.  Easy peasy, life's good, and everything just works.

Now, suppose someone wants to introduce a CPU card that has a wider data bus.
For grins, let's assume an Intel 8086.  Please note: I'm not talking
about an 8088, its 8-bit little brother.  That would be way too easy.  I'm
talking an honest to goodness, died in the wool 16-bit chip here.  Well, if you
look at one of the community enhancements to the RC2014 bus, you'll see the
BP80 backplane, which sure enough, offers a 16MB address space and a true
16-bit data bus.  We could use that!  So you try that, and ...

...

You'll soon enough notice that there's no way to identify which half of the bus
is currently transferring data, which is kind of important when your
processor's smallest data unit is an 8-bit byte, yet exposes a 16-bit or wider
bus.  If all you're doing is reading data into the processor, you can cheat.
Even addresses (that is, addresses where A0=0) can almost always safely read
memory either as a full 16-bit word or as an 8-bit byte, as the lower
byte will be the same either way.  Likewise, if the address is odd, you can
safely assume the upper byte holds the data you're looking for, presumably
because address bit A0 will be ignored by whatever memory card you're reading
from.

The problem comes when you're trying to access a device where these assumptions
don't hold.  For example, attempting to write a single byte will result in the
opposite byte becoming trashed.  A read-modify-write transaction would be a
useful fix for this, if performance wasn't a criterion.  But, when it comes to
touching I/O devices, even this work-around won't work.  You need some
means of telling the hardware which bytes are valid when writing.  You can't
get around it.

So, as defined, the BP80 will work great for processors similar to a TMS9900
(whose smallest unit of memory is indeed a 16-bit word), but not for an 8086 or
80286.  UNLESS, that is, you use one of the currently undefined pins as
a way to expose additional information.  As it happens, both of these
processors expose a pin called Byte High Enable, or BHE#.  (The #-suffix is
there to indicate that it is an active low signal.)

Here's how BHE# works in conjunction with A0:

    | BHE# | A0 | Address | Size | Lane(s)       |
    |:-----|:---|:--------|:-----|:--------------|
    | 0    | 0  | Even    | Word | D15-D8, D7-D0 |
    | 0    | 1  | Odd     | Byte | D15-D8        |
    | 1    | 0  | Even    | Byte | D7-D0         |
    | 1    | 1  | illegal |  --  | none          |

Notice how, for even addresses, A0 is low; that is, the even-address lane is
selected when A0 is zero.  This is why BHE is also active-low.  If you
use A0 as a lane select signal, then having an active-low BHE signal lets you
use it as a lane select signal with the same logic circuits.  It would allow
you, for example, to place a Z80 on this 16-bit bus with relative ease:

    A15 |--------------> A15
     :  |
     :  |
     A0 |----*---------> A0
        |    |
        |    |   |\
        |    |   | \
        |    `---|  >o-> BHE#
        |        | /
        |        |/
        |

(Mapping D15-D8 to D7-D0 based on the state of BHE# not shown for simplicity,
but that logic isn't hard either.)

Basically, the 8086's A19-A0 presents the address just as you would expect,
coming from your Z80 or 8080 background.  But, now that there are two bytes
transferred, both A0 and BHE# indicates which halves of the bus are valid.  The
table above may look somewhat complicated, though.  It seems like BHE# is valid
only sometimes.  I'm going to argue that's because Intel chose a set of
poor names for these signals.

Let's change the names of the signals to something similar from another family
of microprocessors you might be familiar with, the MC68000:

    | UDS# | LDS# | Address | Size | Lane(s)       |
    |:-----|:-----|:--------|:-----|:--------------|
    | 0    | 0    | Even    | Word | D15-D8, D7-D0 |
    | 0    | 1    | Odd     | Byte | D15-D8        |
    | 1    | 0    | Even    | Byte | D7-D0         |
    | 1    | 1    | illegal |  --  | none          |

Officially, the 68000 does not expose address bit A0.  You have A1-A23,
and two signals, UDS# and LDS#, short for upper and lower data
strobes, respectively.  Now, with these names, it becomes clear how these work
in conjunction with the address bus: address bits A1-A23 selects the memory
location at 16-bit granularities, while the UDS# and LDS# signals selects which
(or both) bytes of the data bus is used to convey data.

NOTE: Sharp-eyed readers will note that the 68000 maps D0-D7 to the UDS#
signal and D8-D15 to LDS#, which is the opposite of the arrangement shown
above.  The reason is simple: the 8086 is a little-endian processor, while the
68000 is a big-endian processor.  This means, on the 68000, LDS# is actually
address bit A0, and UDS# is its "byte low enable" signal.

So, as you can see, both Intel and Motorola are actually using the same
mechanism to support 16-bit transfers on their respective chips, endian issues
notwithstanding.  The only cost to the hardware engineer is a single
additional signal, the UDS# or BHE# signal, for a total of 17 bits to express
an address.

## Let's Go Wider!

Now, if we want to use a 32-bit processor, the BP80 backplane with its 16-bit
data bus will not be sufficient.  We'll need to add another 16 data bits, plus
yet more lane selects to indicate if the new lanes hold valid data.  As you can
imagine, there are several ways of doing this.  The most obvious approach is,
again, explicit lane selects.  Your address bus will now consists of A15-A2,
since A1-A0 are internally used to select one of four byte lane selects
(BE3-BE0).  

Now, if you count the signals used, you'll see that we again can get away with
only a single extra signal: 4 lane selects plus 14 address bits equals 18 bits
total.  So, this looks like it's the most effective approach still.  Great!
But, is there a different way we can get the same functionality without blowing
up our interface?  It turns out, yes there is.  Whether or not it's
advantageous for you will depend on your precise project requirements.

Since we're now dealing with a 32-bit word size, let's examine the A0/BHE
relationship again, but this time using half-word for a 16-bit quantity.

    | BHE# | A0 | Address | Size      | Lane(s)       |
    |:-----|:---|:--------|:----------|:--------------|
    | 0    | 0  | Even    | Half-word | D15-D8, D7-D0 |
    | 0    | 1  | Odd     | Byte      | D15-D8        |
    | 1    | 0  | Even    | Byte      | D7-D0         |
    | 1    | 1  | illegal |  --       | none          |

What if we take this pattern and just scale it up to entire 16-bit halfwords?
Instead of a byte high-enable, we also have a halfword high-enable as
well.  It's correlated address line would be the A1 signal, like so:

    | HHE# | A1 | Address | Size      | Lane(s)         |
    |:-----|:---|:--------|:----------|:----------------|
    | 0    | 0  | Even    | Word      | D31-D16, D15-D0 |
    | 0    | 1  | Odd     | Half-word | D31-D16         |
    | 1    | 0  | Even    | Half-word | D15-D0          |
    | 1    | 1  | illegal |  --       | none            |

By distributing the A0/BHE across both halfwords, we can compose these two
masks, which lets us address any byte, any aligned half-word, and/or any word
in memory.

         .---------------------------*----------------------- HBE#
         |             .-------------------------*----------- A0
         |             |             |           |
         |             |             |           |
         |             |             |           |
         | .-------------*----------------------------------- HHE#
         | |           | |           | .-----------*--------- A1
         | |           | |           | |         | |
         o o           o o           o o         o o
        +---+         +---+         +---+       +---+
        | & |         | & |         | & |       | & |
        +---+         +---+         +---+       +---+
          o             o             o           o
          |             |             |           |
          |             |             |           |
         BE3#          BE2#          BE1#        BE0#
          |             |             |           |
          |             `-----------.,'           |
          `------------------------.||,-----------'
                                   ||||
    | HHE# |  A1  | BHE# |  A0  | Lanes | Notes   |
    |:-----|:-----|:-----|:-----|:------|:--------|
    |  0   |  0   |   0  |   0  |  XXXX |         |
    |  0   |  0   |   0  |   1  |  X.X. | illegal |
    |  0   |  0   |   1  |   0  |  .X.X | illegal |
    |  0   |  0   |   1  |   1  |  .... | illegal |
    |  0   |  1   |   0  |   0  |  XX.. |         |
    |  0   |  1   |   0  |   1  |  X... |         |
    |  0   |  1   |   1  |   0  |  .X.. |         |
    |  0   |  1   |   1  |   1  |  .... | illegal |
    |  1   |  0   |   0  |   0  |  ..XX |         |
    |  1   |  0   |   0  |   1  |  ..X. |         |
    |  1   |  0   |   1  |   0  |  ...X |         |
    |  1   |  0   |   1  |   1  |  .... | illegal |
    |  1   |  1   |   0  |   0  |  .... | illegal |
    |  1   |  1   |   0  |   1  |  .... | illegal |
    |  1   |  1   |   1  |   0  |  .... | illegal |
    |  1   |  1   |   1  |   1  |  .... | idle    |

    "Illegal" doesn't necessarily imply wrong or useless; however, such states
    will not be generated by a conventional processor.

    "Idle" means a quiescent bus state, where no data transfers are occurring.

We've established that we can convert one form to another with just a handful
of OR-gates; are there any advantages to selecting byte lanes this way?  For
32-bit or narrower interfaces, I really don't know; in fact, I suspect not, or
we'd see this interface used in more places.

Still, I claim it is worthy of study.

I do know that it might require fewer logic gates on the part of the
processor, since it can just route all address bits to the memory interface
directly without further interpretation, setting the high-enables to their
inverted address bits values.  For multi-byte transfers, just gate one or more
high-enable low as appropriate.  This is a complexity win for the CPU.

However, we still require lane selects to be generated elsewhere as a
consequence of that decision, since memory chips are always individually
enabled.  Additionally, especially since it doesn't save any pins on the
interface, you're kind of just being a jerk if you're forcing a more complex
bus decoding logic on an engineer.

## These Barn Doors Aren't Going to Open Themselves

However, it turns out that the 32-bit data path is something of an inflection
point.  What if we want to now update to a 64-bit bus?  With a conventional
design, we would need to add 32 more data lines, and four more lane
enables, while subtracting only one address line.  So, 13 address bits (A15-A3)
plus a total of 8 lane enables equals...oh dear.  That's 21 bits!  Suddenly,
the simple relationship we had when widening an 8-bit bus to a 32-bit bus no
longer holds!  And, it gets worse with still wider buses; it's not uncommon to
use 128-bit buses to transport entire cache lines, for example.  A 128-bit bus
moves 16 bytes per transfer cycle, so our address bus would only signal A15-A4,
while we would have sixteen byte lane selects.

    | Data Width | Lane Selects | Address Bits |  Pins |
    |:----------:|:------------:|:------------:|:-----:|
    |     8      |      0*      |      16      |   24* |
    |    16      |      2       |      15      |   33  |
    |    32      |      4       |      14      |   50  |
    |    64      |      8       |      13      |   85  |
    |   128      |     16       |      12      |  156  |
    |   256      |     32       |      11      |  299  |

    * - CPUs which have idle bus states need some way of indicating whether or
    not the bus is currently in use.  One could argue such a signal counts as a
    lane select.  If you agree with this argument, add 1.

What can we do?!

Assuming our goal remains minimizing the number of pins on some data port or
connector, then this is the point at which not exposing explicit lane selects
makes a whole lot of sense, for their numbers are growing much faster than the
number of address lines we'd remove to compensate.  It turns out our clever
technique of overlapping high-enable masks comes to the rescue!

With a 16-bit bus, we've doubled up our bytes, so it makes sense to have a High
Byte Enable signal to compliment address line A0.  With a 32-bit bus, we've
doubled up our half-words; thus we introduced a Halfword High Enable to
compliment A1.  With a 64-bit bus, therefore, we've doubled up our words.
Thus, it follows we need only introduce a word high-enable signal to
serve as a compliment our A2 address line.

With this configuration, since we've doubled our bus width just add a
word-high-enable (WHE#) signal to compliment the A2 signal.  So, instead of 21
addressing bits, you're exposing 19.  It's not much of a savings at this
point, but the savings is measurable.  With a 128-bit data width, the savings
is significant.

    | Data Width | High Enables | Address Bits |  Pins | Savings |
    |:----------:|:------------:|:------------:|:-----:|:-------:|
    |     8      |      0       |      16      |   24  |    0    |
    |    16      |      1       |      16      |   33  |    0    |
    |    32      |      2       |      16      |   50  |    0    |
    |    64      |      3       |      16      |   83  |    2    |
    |   128      |      4       |      16      |  148  |    8    |
    |   256      |      5       |      16      |  277  |   22    |

Here's the byte-enable patterns a 64-bit interface is capable of generating.
Notice the almost fractal-like nature of valid enables.

    | WHE# |  A2  | HHE# |  A1  | BHE# |  A0  | Lanes     | Notes   |
    |:-----|:-----|:-----|:-----|:-----|:-----|:----------|:--------|
    |  0   |  0   |  0   |  0   |   0  |   0  |  XXXXXXXX |         |
    |  0   |  0   |  0   |  0   |   0  |   1  |  X.X.X.X. | illegal |
    |  0   |  0   |  0   |  0   |   1  |   0  |  .X.X.X.X | illegal |
    |  0   |  0   |  0   |  0   |   1  |   1  |  ........ | illegal |
    |  0   |  0   |  0   |  1   |   0  |   0  |  XX..XX.. | illegal |
    |  0   |  0   |  0   |  1   |   0  |   1  |  X...X... | illegal |
    |  0   |  0   |  0   |  1   |   1  |   0  |  .X...X.. | illegal |
    |  0   |  0   |  0   |  1   |   1  |   1  |  ........ | illegal |
    |  0   |  0   |  1   |  0   |   0  |   0  |  ..XX..XX | illegal |
    |  0   |  0   |  1   |  0   |   0  |   1  |  ..X...X. | illegal |
    |  0   |  0   |  1   |  0   |   1  |   0  |  ...X...X | illegal |
    |  0   |  0   |  1   |  0   |   1  |   1  |  ........ | illegal |
    |  0   |  0   |  1   |  1   |   0  |   0  |  ........ | illegal |
    |  0   |  0   |  1   |  1   |   0  |   1  |  ........ | illegal |
    |  0   |  0   |  1   |  1   |   1  |   0  |  ........ | illegal |
    |  0   |  0   |  1   |  1   |   1  |   1  |  ........ | illegal |
    |  0   |  1   |  0   |  0   |   0  |   0  |  XXXX.... |         |
    |  0   |  1   |  0   |  0   |   0  |   1  |  X.X..... | illegal |
    |  0   |  1   |  0   |  0   |   1  |   0  |  .X.X.... | illegal |
    |  0   |  1   |  0   |  0   |   1  |   1  |  ........ | illegal |
    |  0   |  1   |  0   |  1   |   0  |   0  |  XX...... |         |
    |  0   |  1   |  0   |  1   |   0  |   1  |  X....... |         |
    |  0   |  1   |  0   |  1   |   1  |   0  |  .X...... |         |
    |  0   |  1   |  0   |  1   |   1  |   1  |  ........ | illegal |
    |  0   |  1   |  1   |  0   |   0  |   0  |  ..XX.... |         |
    |  0   |  1   |  1   |  0   |   0  |   1  |  ..X..... |         |
    |  0   |  1   |  1   |  0   |   1  |   0  |  ...X.... |         |
    |  0   |  1   |  1   |  0   |   1  |   1  |  ........ | illegal |
    |  0   |  1   |  1   |  1   |   0  |   0  |  ........ | illegal |
    |  0   |  1   |  1   |  1   |   0  |   1  |  ........ | illegal |
    |  0   |  1   |  1   |  1   |   1  |   0  |  ........ | illegal |
    |  0   |  1   |  1   |  1   |   1  |   1  |  ........ | illegal |
    |  1   |  0   |  0   |  0   |   0  |   0  |  ....XXXX |         |
    |  1   |  0   |  0   |  0   |   0  |   1  |  ....X.X. | illegal |
    |  1   |  0   |  0   |  0   |   1  |   0  |  .....X.X | illegal |
    |  1   |  0   |  0   |  0   |   1  |   1  |  ........ | illegal |
    |  1   |  0   |  0   |  1   |   0  |   0  |  ....XX.. |         |
    |  1   |  0   |  0   |  1   |   0  |   1  |  ....X... |         |
    |  1   |  0   |  0   |  1   |   1  |   0  |  .....X.. |         |
    |  1   |  0   |  0   |  1   |   1  |   1  |  ........ | illegal |
    |  1   |  0   |  1   |  0   |   0  |   0  |  ......XX |         |
    |  1   |  0   |  1   |  0   |   0  |   1  |  ......X. |         |
    |  1   |  0   |  1   |  0   |   1  |   0  |  .......X |         |
    |  1   |  0   |  1   |  0   |   1  |   1  |  ........ | illegal |
    |  1   |  0   |  1   |  1   |   0  |   0  |  ........ | illegal |
    |  1   |  0   |  1   |  1   |   0  |   1  |  ........ | illegal |
    |  1   |  0   |  1   |  1   |   1  |   0  |  ........ | illegal |
    |  1   |  0   |  1   |  1   |   1  |   1  |  ........ | illegal |
    |  1   |  1   |  0   |  0   |   0  |   0  |  ........ | illegal |
    |  1   |  1   |  0   |  0   |   0  |   1  |  ........ | illegal |
    |  1   |  1   |  0   |  0   |   1  |   0  |  ........ | illegal |
    |  1   |  1   |  0   |  0   |   1  |   1  |  ........ | illegal |
    |  1   |  1   |  0   |  1   |   0  |   0  |  ........ | illegal |
    |  1   |  1   |  0   |  1   |   0  |   1  |  ........ | illegal |
    |  1   |  1   |  0   |  1   |   1  |   0  |  ........ | illegal |
    |  1   |  1   |  0   |  1   |   1  |   1  |  ........ | illegal |
    |  1   |  1   |  1   |  0   |   0  |   0  |  ........ | illegal |
    |  1   |  1   |  1   |  0   |   0  |   1  |  ........ | illegal |
    |  1   |  1   |  1   |  0   |   1  |   0  |  ........ | illegal |
    |  1   |  1   |  1   |  0   |   1  |   1  |  ........ | illegal |
    |  1   |  1   |  1   |  1   |   0  |   0  |  ........ | illegal |
    |  1   |  1   |  1   |  1   |   0  |   1  |  ........ | illegal |
    |  1   |  1   |  1   |  1   |   1  |   0  |  ........ | illegal |
    |  1   |  1   |  1   |  1   |   1  |   1  |  ........ | idle    |

    "Illegal" doesn't necessarily imply wrong or useless; however, such states
    will not be generated by a conventional processor.

    "Idle" means a quiescent bus state, where no data transfers are occurring.

## Adapting Narrow to Wide Buses

By now, you've gotten a reasonable understanding of how the high-enable method
works.  Pragmatically, it doesn't really offer that much benefit for conserving
pins on interfaces narrower than 64 bits.  What if we wanted to map, say, an
8-bit device into a 16-bit memory space?

Let's once again put on our magical pretend hat, and imagine you're building
the Commodore VIC-20.  It's a 6502-based system which uses a 6522 chip for I/O.
This chip has 16 registers, and being an 8-bit machine, it makes sense to lay
them out adjacently.  So, you connect the chip to the address bus like so:

    |      |
    |  RS0 |------[ A0
    |  RS1 |------[ A1
    |  RS2 |------[ A2
    |  RS3 |------[ A3
    |      |
    |  CS1 |------o +Vcc
    | CS2# |------[ VIASEL# (asserted when CPU touches memory between $9110..$911F)
    |      |
    |   D7 |------[ D7
    |   D6 |------[ D6
    |   D5 |------[ D5
    |   D4 |------[ D4
    |   D3 |------[ D3
    |   D2 |------[ D2
    |   D1 |------[ D1
    |   D0 |------[ D0
    |      |

This results in the VIA being laid out in memory where each register is
adjacent to each other.

    +------+
    |  R0  |    $9110
    +------+
    |  R1  |    $9111
    +------+
    |  R2  |    $9112
    +------+
    | .... |
    +------+
    | R15  |    $911F
    +------+

Now, some years later, Commodore executives approach you with an assignment to
make a successor to the VIC-20.  Your manager pulls you aside, and says, "You
see, IBM's PC (with its 8-bit bus) was recently made obsolete with their IBM
PC/AT, which has a 16-bit data path to its internal components, doubling its
performance."  In this alternate-universe Commodore, you've been given the
assignment to do to the VIC-20 what IBM did to the PC to yield the PC/AT.

Now, in this universe, Commodore would have just routed a 16-bit memory
path to the CPU, and left the I/O chips hanging off the bus with either A0 or
BHE# to (directly or indirectly) serve as a chip select, like this:

    |      |
    |  RS0 |------[ A1
    |  RS1 |------[ A2
    |  RS2 |------[ A3
    |  RS3 |------[ A4
    |      |
    |  CS1 |------[ A0
    | CS2# |------[ VIASEL# (qualified by BHE#)
    |      |
    |   D7 |------[ D15
    |   D6 |------[ D14
    |   D5 |------[ D13
    |   D4 |------[ D12
    |   D3 |------[ D11
    |   D2 |------[ D10
    |   D1 |------[ D9
    |   D0 |------[ D8
    |      |

Notice that we have to shift the address pins by one in order to let A0 select
the byte lane.  What once took 16 bytes of address space now takes 32 bytes.
This would have the effect of spreading the registers out in memory, like so:

       +1     +0
    +------+------+
    |  R0  |  ??  | $9110
    +------+------+
    |  R1  |  ??  | $9112
    +------+------+
    |  R2  |  ??  | $9114
    +------+------+
    | .... |  ??  |
    +------+------+
    | R15  |  ??  | $912E
    +------+------+

Now, you could reclaim address space by just putting another VIA on the lower
data pins and gating it off of A0 and BHE# in the opposite manner; but, the
point here is that all the software written for the VIC-20 would now be broken
because the layout of the VIAs has changed.  The machine may be functionally
compatible with its predecessor, but it is no longer binary compatible.
And, that was what made the PC/AT so successful -- binary compatibility.

TRIVIA: This is how the Commodore-Amiga addresses its CIA chips, and why
they're referred to as the "Odd" and "Even" CIA chips.

The goal is to lay the VIA chip out like so:

       +1     +0
    +------+------+
    |  R1  |  R0  | $9110
    +------+------+
    |  R3  |  R2  | $9112
    +------+------+
    |  R5  |  R4  | $9114
    +------+------+
    | .... | .... |
    +------+------+
    | R15  | R14  | $911E
    +------+------+

The data bus must be connected to D7-D0 for even addresses or to D15-D8 for odd
addresses.  We can use bidirectional bus transceivers for this, or 4066
bilateral switches, or whatever is fast enough to do the job.  As long as we
can guarantee that either A0 or BHE# is asserted low, we can use a
circuit like the following:

    |      |
    |  RS0 |------[ A0
    |  RS1 |------[ A1
    |  RS2 |------[ A2
    |  RS3 |------[ A3            VIABHE#
    |      |                        ---
    |  CS1 |------o +5V              |
    | CS2# |------[ VIASEL#          o
    |      |                   +----------+
    |   D7 |------*------------| A7    B7 |-----[ D15
    |   D6 |------|*-----------| A6    B6 |-----[ D14
    |   D5 |------||*----------| A5    B5 |-----[ D13
    |   D4 |------|||*---------| A4    B4 |-----[ D12
    |   D3 |------||||*--------| A3    B3 |-----[ D11
    |   D2 |------|||||*-------| A2    B2 |-----[ D10
    |   D1 |------||||||*------| A1    B1 |-----[ D9
    |   D0 |------|||||||*-----| A0    B0 |-----[ D8
    |      |      ||||||||     +----------+
    |      |      ||||||||     +----------+
    |      |      `------------| A7    B7 |-----[ D7
    |      |       `-----------| A6    B6 |-----[ D6
    |      |        `----------| A5    B5 |-----[ D5
    |      |         `---------| A4    B4 |-----[ D4
    |      |          `--------| A3    B3 |-----[ D3
    |      |           `-------| A2    B2 |-----[ D2
    |      |            `------| A1    B1 |-----[ D1
    |      |             `-----| A0    B0 |-----[ D0
    |      |                   +----------+
    |      |                         o
    |      |                         |
    |      |                        ---
    |      |                      VIABLE#

    where VIABHE# = VIASEL# OR HBE# and VIABLE# = VIASEL# OR A0

As you can see, it's fairly easy to adapt an 8-bit device to a 16-bit data
path.  We can even generalize this and apply multiple layers of data routing
logic, each controlled by its own pair of address bit and high-enable.  Each
layer takes care of half of the bus at a time.  So, closest to the device, you
have an 8-to-16 bit adapter, then a 16-to-32-bit adapter, etc.

Implementing this data steering logic with discrete byte lane selects is less
convenient, in part because you must reconstruct the missing address
bits.  I'll leave it as an exercise to the reader to figure out how you'd build
the VIC-PS/20 32-bit update while still retaining 100% backward compatibility.
(Hint: it's not much more complex than the above circuit; but it does
require more logic.)

The only thing that is missing here is support for accessing the 8-bit VIA
device as though it were a 16-bit (or wider) device.  Yes, we've restored its
memory layout, and legacy 8-bit software should now run fine; but, since you
cannot have both VIABHE# and VIABLE# enabled at the same time without the bus
contention at the VIA chip interface, all 16-bit accesses must be broken up
into discrete 8-bit accesses.  This is something that an external state machine
(often called a "bus controller" or "bridge") that sits between the CPU and the
VIA can do on behalf of the CPU (and, indeed, something like this is exactly
how 8-bit devices remain supported on 16-bit ISA slots.  If you ever hear of
the PC's "Chip Set", that would refer to, at least in part, its bus controller
logic).  

## Conclusion

Is there really a conclusion to all this brain-dump?  I'm not sure there is.

I can say that I have a new appreciation for Intel's thought process when they
specified the 8086 and 80286 bus interfaces.  I also feel I contributed
something somewhat new-ish when coming up with the idea of applying
high-enables to larger units of the data path, not just bytes.  I really don't
think I'm 100% original though; I find it really hard to think that this is
some novel way of encoding byte lane activity.  Yet, in my Googling and
Duck-Duck-Going around, I haven't found any evidence of this approach being
used before.

This method needs a name.  I don't know what to call it.  Looking at the lane
select charts above, it looks like there's almost a fractal or wavelet nature
to it.  Should we just call it fractal or wavelet addressing?  I'm still not
sure that those terms fit quite right though.  For now, I'll just use the
phrase "high-enable encoding."

High-enable encoding is not the simplest possible approach to solving the
problem of telling external logic where to put data; but, it's not
terrible either.  For memory devices, explicit lane selects will be
needed no matter what.  For narrow peripherals densely mapped onto the wider
data path, you're going to be needing data steering logic.  At least under
initial scrutiny, it looks like this steering logic is simplified with
high-enable encoding.  And, regardless of whether you use high-enables or
explicit lane selects, addressing narrow devices compactly will require a bus
controller or bridge of some kind in order to break up large transfers into
units the peripheral can handle.  If you don't care about backward
compatibility, though, placing narrow devices in the address space sparsely
hardly uses any logic at all no matter what approach is used.