# On Evolving Channel I/O Concepts Towards Offering System Services

So, I'm kind of thinking that the idea of operating systems with lots of system
calls were an "effective mistake."  It's a natural outgrowth of the subroutine
library, which itself evolved in a time before operating systems even existed.
They work insofar as they do; however, they seem to incur a kind of synchronous
bottleneck on the kind of operations one can perform.  Asynchronous operations,
which are becoming more and more important as the days go on, tend to have
really complicated semantics, interface requirements, or both.

Not only that, but user/kernel space transitions tend to be more costly than
normal subroutine calls.  Even on modern, high-performance CPUs where the
processing cost of this transition is effectively zero (such as what you'd find
in RISC-V processors or modern x86 CPUs with a SYSCALL instruction), you still
have the software overhead of copying buffers to/from user-space into kernel
space.  Earlier processors have things much worse.  Even a simple subroutine
call can be rather expensive.  For instance, a 6502 or Z80 processor consumes
between 12 (JSR) and 21 (RST) to 27 (CALL) clock cycles, respectively.  You can
tangibly see the effect this has by looking at old IBM PC software,
where console I/O was performed using character-at-a-time BIOS calls, resulting
in visible line-at-a-time screen updates whereas video games were significantly
faster.

One work-around to this impact is to perform operations in bulk.  Early
operating systems, for example Commodore's KERNAL or the IBM PC's BIOS, tended
to perform I/O operations one character at a time.  Unix-style
read-/write-operations, which transfers an entire buffer at a time, would
amortize the cost of a synchronous I/O operation over the size of the buffer
transferred.  But, you are still restricted to transferring at most one
buffer's worth of information at a time.  And, for devices which produced
significantly smaller units of data, the overhead can still rack up really
fast.  Moreover, handling multiple input devices at a time was still an
impossibility without some obnoxious programming techniques.

Another disadvantage of this interface is it's very much a leaky abstraction.
A system call is, well, a subroutine call, which means the caller and
callee had to agree on an ABI.  You know that there's some amount of
software running on the same computer your code is running on, designed to
process your request and return the results.  The transition from unprotected
system calls (e.g., Commodore KERNAL) to protected (or, at least,
protected-capable; e.g., MS-DOS or Windows) is frequently enough to break
software as applications are ported from platform to platform.

We can imagine a different way of invoking system services, though.  One which
is independent of microkernel or monolithic kernel architectures, single- or
multi-processor systems, or synchronous or asynchronous I/O requests.  One
which has the potential, at least, to amortize the cost of a system call even
on inefficient, slower processors across many requests at once.  System
service providers, however they may be implemented, can respond to requests
in-order if they want, or they can dole out tasks to other background
processors for eventual completion.  As long as a simple request/response
protocol is understood and obeyed by "the kernel" and application alike, none
of these implementation issues matter.

I am, of course, suggesting following the same basic interface methods as how
one would talk to an NVMe storage device.  I am speaking, of course, of the use
of a request or command queue to submit one or more operations for the kernel
to process, and a result queue to inform the program when requests have been
completed and what the results of that operation were.

While the flow of control is more complicated than Unix-style kernel calls, the
process is simple nonetheless.

1. The application fills in one or more request queue entry structures.  This
   tells the kernel what operation(s) to perform and in what order, along with any
   data necessary to complete the operation.  The application does this for as
   many requests as are required, or for however many slots remain available in
   the queue, whichever is smaller.  All pointers to ancilliary data structures
   are specified in application address space terms.

2. The application then makes a single system call.  This causes the
   kernel to start to pop the request queue as services are rendered.  The kernel
   may optimize processing of these requests however it sees fit.  It might return
   right away after having scheduled asynchronous background operations; or, it
   might process them one by one and not return until each and every one has been
   completed.

3. To check completion, the application then monitors its result queue.  As
   services complete, the kernel will queue corresponding response entries onto
   the result queue.  As long as interrupts are enabled, the program can detect
   when one or more new result queue entries become available by performing a
   simple pointer comparison.  Alternatively, a kernel may be configured to invoke
   a callback routine for one or more requests.

4. The application can then dispatch notifications to various parts of its
   program in response to the service completions in the order received.

Applications and the system software agree to cooperate via a single shared
data structure, which might look something like this:

	struct SystemQueue {
		void *   base;
		uint8_t  bound;
		uint8_t  head;
		uint8_t  tail;
	}

	struct SystemChannel {
		struct SystemQueue request;
		struct SystemQueue result;
	}

Under normal conditions, the application only ever touches request.tail and
result.head, while the system only touches request.head and result.tail.
Either of these entities work by first writing a new entry to the appropriate
queue, and only then updating the head or tail field as appropriate.  For
single-processor, single-threaded environments (which is almost exclusively the
norm with 8-bit CPUs which lack any form of complex locking facilities), this
is sufficient to ensure atomicity.  It also ensures upward compatibility to
multi-processor systems later: if a kernel is updated to become multithreaded,
for example, it may keep a lock on its side transparently, without application
knowledge.  From the app's point of view, a single kernel is updating its side
of the queue.  Likewise, if the application becomes multithreaded later on, it
may incorporate its own lock as well.

As long as applications and the system keep to their own halves of the queue,
an application can be enqueueing a new request while the system is reading a
pending request at the same time.  Likewise, an application could be dequeueing
a result while the system is enqueueing a new result at the same time.  As long
as the head and tail are updated atomically (which is guaranteed since they're
only a single byte wide), there ought be no need for further synchronization
primitives.

Since each request implies one result, it follows that the result queue should
be at least as large as the request queue, in terms of number of elements.  If
your application is diligent at responding to replies, you can probably get
away with a smaller response queue.

When a program starts up for the first time, it might be helpful to have a
default set of command/response queues given to it; however, since the queues
are application-managed, wherein all pointers are expressed in application
address space terms, it follows that the application itself can (re-)allocate
and move the queues as required.  For processors with small address spaces
(64KB or less), there might not be enough resources for flexible dynamic memory
management.  Consider an operating system like GEOS on the 6502 processor
family, which only leaves about 24KB of space available for applications.  So,
it's better if the application manages the size of its own queues and just
tells the kernel where they're located and how big they are.  (This is one of
the few synchronous system calls that does exist under this scheme.  Since this
is expected to happen only once or twice during a program's initialization, it
is not expected to incur an onerous performance hit.)

For example, let's consider the case of determining the size of a file.
Pretending for the moment that we have a set of operations to open a file,
close a file, seek to a point in a file, and to report the current file pointer
location, we can synthesize a single "batch" operation composed of these
smaller primitives.  In a Unix-like environment, we might write the following:

1. Call open() to open the file.
2. Check for errors.
3. Call seek() to reposition the file pointer to the end of the file.
4. Check for errors.
5. Call tell() to report the position of the file pointer.
6. Check for errors.
7. Call close() on the file.

None of this is particularly difficult, since the program steps operate
synchronously anyway.  However, if you are determining the sizes of, say,
several thousand files, this can take an appreciable amount of time.
Maintaining a responsive user interface while this is happening can be
difficult or even impossible.  However, with a queue-based system call
interface, we might write the code like so:

1. Enqueue open file request.
2. Chain (enqueue with the dependency that the previous request succeeds) a seek to end of file request.
3. Chain a report file position request.
4. Chain a close file request.
5. Invoke the kernel once to begin processing these requests.
6. Wait for all four responses in the result queue.  Handle UI-related events as they arrive.
7. Check the 3rd result; make sure it's not an error report.
   a. If it's OK, the position reported will be the file size.
   b. If it is an error, then check the 1st result (which will also be an error) to see what happened.
8. Dequeue the results.

Each request structure might look like this:

	struct RequestEntry {
		uint16_t args[8];
	}

The first argument args[0] might, for instance, be broken up into two 8-bit
subfields: a provider and an opcode.  The provider would specify which
subsystem is responsible for handling the request: the DOS/VFS for disk I/O,
the GUI for graphics output, etc.  The opcode is, then, provider-specific.
This is similar to how MacOS toolboxes and Tock:OS capsules work, for example.
Metadata and data buffer references can be packed into subsequent arguments as
required for the function requested.

Result entries would have a similar structure layout, but whose interpretation
is uniquely adapted to reporting results of an operation.

Another advantage of this approach towards implementing a computing system is
portability.  Consider a kernel for the 6502 processor, and an application that
runs under it.  Later, a new computer is released with the 65816 processor,
necessitating a 16-bit OS to take advantage of its features.  If running the
original 8-bit system software with a call-based interface, then invoking
services from the 65816's native mode requires a lot of work!  You basically
need to implement a task switcher to make it happen.  Well, if you're going to
go through the trouble for that, you might as well write a more complete kernel
along the way.  But, now you've broken compatibility with legacy 6502 code that
you may still want to run (you need to provide an 8-to-16-bit shim).  By
maintaining a request queue abstraction, you are completely free to alter the
backing implementation without concern whatsoever for the CPU's current
operating mode.

You might think I'm being hyperbolic or overly hypothetical.  However, this
kind of interaction has evolved several times throughout computing's history.
We see it in the form of "channel I/O" on IBM mainframes, I/O processors in CDC
mainframes (where the CDC 604's operating system is literally executed by one
of its 10 I/O processors!), and today, in NVMe interfaces and in VirtIO
in Linux kernel driver interfaces.  Remember what I said about 6502 vs 65816
kernel implementations above?  This is not hypothetical; IBM already figured
this out years ago, when they replaced the channel I/O hardware on mainframes
with PowerPC-based channel emulator cards.  To this day, software written on
IBM mainframes still think they're talking to channel controller hardware built
in 1964 for the original IBM System/360 mainframe, only much, much, much
faster.

But, these interfaces are optimized for hardware; are there any software
equivalents?  Yes!

AmigaOS and Tripos both implement a fully functional OS which is almost
entirely predicated on fully asynchronous API requests (c.f., ARexx ports,
low-level device I/O, installable filesystem "handlers", GUI events, etc. are
all asynchronous).  However, the problem with AmigaOS and Tripos is they both
make use of dedicated system calls for enqueueing and dequeueing messages.
While lightweight and so convenient to use you'll be spoiled forever, the
message passing overhead is still higher than simply touching memory.  This was
such a concern with AmigaOS that some of its message handling functions are
duplicately implemented as C and assembly macros in amigalib.  Also,
both AmigaOS and Tripos still tend to hide the asynchronous logic behind a
synchronous library API at the systems level.

If you squint a little bit harder, you can also see traces of this in
"cache-kernel" systems which use message-based signalling (read up on how the
V++ kernel works).  Indeed, the only reason we even still have a system call at
all is because most computers lack hardware to transparently trap to a message
handler after updating a software-managed request queue.  (Debugger support,
like hardware watchpoints, might be useful here.)  Cache kernel systems with
hardware-assisted messaging features can get away without ever
explicitly "calling" into a kernel.

Closer to what was discussed above, Tock:OS's system call API maintains a
remarkably small set of system calls (smaller than most microkernels!).  The
set of primitives offered by a Tock:OS "capsule" is obviously dependent upon
each capsule, but most have interfaces which amount to "enqueue this request"
while Tock:OS itself has a system call which waits until an event happens,
invoking application-specific callbacks for specific events.  The callback
model is actually remarkably close to how the homebrew virtual machine Uxn
works.

There are many articles calling POSIX to be relegated as a legacy API.  If we ever want to move beyond the
POSIX API and the bottlenecks it imposes on us as application developers, I
think "channel systems" are the way forward.  While not as easy to use as a
synchronous API, they are vastly simpler than existing alternatives.  The fewer
system calls there are, the better.  As far as I can tell, nothing else
delivers the bang for the simplicity-vs-scalability buck that channel-based I/O
and service requests do.