<- PreviousBack to post indexNext ->

Gripes With Pipes (My non-problems with the shell)

2021-09-09

A key feature of the POSIX shell is the pipe, allowing the output of one process to be "piped" into the input of another. This is extremely powerful, but sadly has several flaws that led me to start writing my own alternative shell-alike. Rather typically, while researching for this project I found that almost every one of my "gripes with pipes" was totally unfounded, so I have documented my discoveries here in the hope that I can save others the hours I wasted trying to best the POSIX shell.

This post will involve a fair amount of POSIX shell scripting and C programming, but I've explained all of it as best I can so if you have at least a surface level understanding of these things then you should be fine.

shtanton's 4 gripes with pipes

Gripe 1: I can only pass newline delimited strings

Every stdio-based unix utility I've ever used both takes as input and gives as output a series of lines. I think this is because if nothing is being piped into or out of a process, then it uses the tty, which takes lines of input from the user and displays lines of output. In reality, since stdin and stdout are both byte streams, any data that can be represented as bytes (so any data at all) can be passed through a pipe. Note that 0 and 1 in the calls to read and write are file descriptors referring to stdin and stdout respectively. A file descriptor is just a number that refers to a file in POSIX.

main.c: Define a person "john" and write the data to stdout (file descriptor 1)

#include <unistd.h>
#include <string.h>
#include "types.h"

int main() {
	struct Person john;
	strcpy(john.first_name, "John");
	strcpy(john.last_name, "Smith");
	john.age = 21;
	write(1, &john, sizeof(struct Person));
	return 0;
}

receiver.c: Read a person from stdin (file descriptor 0) and display their full name and age

#include <unistd.h>
#include <stdio.h>
#include "types.h"

int main() {
	struct Person person;
	read(0, &person, sizeof(struct Person));
	printf("%s %s %d\n", person.first_name, person.last_name, person.age);
	return 0;
}

types.h: A person has a first name, last name and age

struct Person {
	char first_name[30];
	char last_name[30];
	signed char age;
};

Then in my shell I can compile these and run:

$ ./main | ./receiver
John Smith 21

This is an example of a C struct containing 2 strings and a byte integer being passed between processes using a pipe, and there's no reason we couldn't go further. There are already utilities that communicate with JSON over stdio, but we could send video/audio streams, cap'n proto data or anything else we want. These probably exist so email me (shtanton at shtanton.xyz) if you know of any :)

Gripe 2: Each process can only have 1 input stream and 2 output streams

Every process has stdin (fd 0), stdout (fd 1) and stderr (fd 2). So those are your 3 options for io right? Well other file descriptors are available but we need some new shell syntax to effectively use them:

command 5> file

This opens file and gives it file descriptor 5, which means when we write to 5 in command, it will write to file. This gives us an additional 7 inputs/outputs that our processes can make use of (since that shell syntax only goes up to 9).

main.c: We have two people, their first names get written to stdout, while their last names get written to file descriptor 3

#include <unistd.h>
#include <string.h>

#include "types.h"

int main() {
	struct Person people[2];

	strcpy(people[0].first_name, "Joe");
	strcpy(people[0].last_name, "Bloggs");
	people[0].age = 21;

	strcpy(people[1].first_name, "John");
	strcpy(people[1].last_name, "Smith");
	people[1].age = 40;

	for (int i = 0; i < 2; i++) {
		write(1, people[i].first_name, strlen(people[i].first_name));
		write(1, "\n", 1);
		write(3, people[i].last_name, strlen(people[i].last_name));
		write(3, "\n", 1);
	}
	return 0;
}
$ ./main > first_names 3> last_names
$ tee < first_names
Joe
John
$ tee < last_names
Bloggs
Smith

We can also use the extra file descriptors as inputs with 3< and the like. This lets us input/output to several files, and one of them we could pipe into a further command with the help of redirections (that 3>&1 monstrosity is a redirection, which I won't explain here):

$ ./main 3>&1 > first_names | sort
Bloggs
Smith
$ tee < first_names
Joe
John

but how can both outputs become inputs for other commands?

First In First Out (fifo)

A fifo does what it says on the tin. You write bytes to it, and the bytes are read in the order they were written. They are also known as named pipes which hints at what we can use them for:

$ mkfifo /tmp/first_names
$ mkfifo /tmp/last_names
$ ./main > /tmp/first_names 3> /tmp/last_names & sort < /tmp/first_names > first_names & sort < /tmp/last_names > last_names

These commands create two fifos in the /tmp directory, and use one for each output of main, then assigns each to be the input of an instance of sort. This achieves our goal of a process with 2 outputs each going into a different process. Unfortunately, it's quite a lot of typing to go down this route and while some shells have made attempts to add syntax to alleviate this, POSIX doesn't define a better method for doing this.

At this point all of our reads and writes are to files, which is why most programs take files as arguments if they need more than one input and output, for example tee, which I'll explain in a moment. I should also add that since stderr is generally for debugging, I would not recommend using it for serious data output and certainly not anything that isn't a string.

Gripe 3: Cyclic data dependency is impossible

Now that we have fifos in our toolbelt, cyclic data dependency is easy:

mkfifo /tmp/fifo
echo yes fifo > /tmp/fifo & tee /tmp/fifo < /tmp/fifo

This makes a fun version of the yes utility (which loops forever, repeatedly printing the same string). The tee command takes its input from /tmp/fifo but also outputs to /tmp/fifo as well as to the tty, giving us the cycle we wanted.

Gripe 4: Byte streams are the only form of data passing

POSIX and System V (two sets of standards that many operating systems adhere to) both define 3 types of IPC (inter-process communication):

but in the shell we only have access to byte streams, how come? Well...

Message queues

Message queues and byte streams are very similar, in fact a byte stream could easily act as a message queue. If the messages are all of the same size then they can very easily be passed and received without any encoding at all and if the length varies then prepending each with an 8 byte length which is read to tell the receiver how big the rest of the message is could be used.

Shared memory

Shared memory is more performant and more natural than message passing for some tasks, but in my experience there aren't that many of these and they can often be very easily translated into a message passing style. For example, if a small amount of data is being shared, then the data could be resent every time it changes, if it is a larger amount then a message could indicate a part of the data and what to set it to. Shared memory opens up a lot of potential for data races without careful consideration so I don't think it belongs on the shell.

Semaphores

Semaphores can be very easily simulated by sending a single byte one way for up and the other way for down. This is less performant but that's not really a problem on the shell.

Byte streams aren't always the most natural or the fastest solution to a problem, but they work very well most of the time and they are intuitive, easy to reason with, versatile and robust. I don't think the shell needs anything else.

Conclusion

Make fifos less verbose and I would consider the shell perfect for data passing!

Feel free to email shtanton at shtanton.xyz with any feedback or if you just want to start an email conversation with someone very opinionated about programming and operating systems.