Stred Part 2: No More Tokens

2023-12-09

This post is a follow up to How not to adapt sed for JSON. It won't make loads of sense without reading that post first, but might still be interesting. In that post, I evaluated a version of sed that split up JSON into lexical tokens to process it in a stream but concluded that a token-based approach has too many drawbacks.

In this post, I'll consider how to change that design to be based around values instead of around tokens.

Traversing JSON with a stream of values

JSON is easy to turn into a stream of tokens, because it already is just a list of tokens, that's the point of tokens. Unfortunately, the same is not true for values. An array is a value but it doesn't really go before or after the values it contains, it goes around them. This makes turning JSON into a stream of values a bit tricky.

Simple values (null, booleans, numbers and strings) are also tokens so the approach from the last post still works. The difficulty lies with arrays and objects.

Arrays

The input stream reaches an array in the JSON input, what does it do? What list of values does it pass into the program?

One option is to only pass the elements of the array in, relying on the path to indicate arrays with an index. For example, if the path is "people" 0 then the value is the first element in the array in the "people" property of the root object. This works well for the input side, but is a pain for outputting as the job of tracking the current index of the array is put on the user. Operations like splitting arrays into parts would be really annoying with this system as for each element in the original array we'd have to calculate an index for the part it should go into, and its index within that part. What's more, empty arrays would be completely invisible!

I think there might be ways for the above idea to work well, but for now, I chose to be inspired by gron. The approach is to pass on an empty array value at the start of each array, and then go through each array element in turn. This means that when we output, we don't need to care what the index is. The outputting value just goes onto the end of whatever array is currently "open" and if we want a split between two arrays at the same level of nesting, we output an empty array value between them.

This still has a fairly serious disadvantage which I'll go over at the end, but for now, it's pretty decent.

Objects

Objects are pretty similar to arrays, but have keys. Fortunately, those keys are already part of the path, so they can be treated identically: read in an empty object, then each of the values in turn.

Example

Let's put this together by traversing an example JSON input

JSON:
{
	"name": "John Doe",
	"friends": ["Jack", "Jo", "Jim"]
}

Path-value pairs:
: {}
"name": "John Doe"
"friends": []
"friends" 0: "Jack"
"friends" 1: "Jo"
"friends" 2: "Jim"

Reassembling written out values

I haven't yet answered the question of how values will be reassembled on the other side. Consider this example:

JSON:
{
	"sequences": {
		"naturals": [0, 1, 2, 3, 4]
	},
	"colour": "purple"
}

Path-value pairs:
...
"sequences" "naturals" 4: 4
"colour": "purple"

There are no outputs between 4 and "purple" to indicate that an array and an object need to end. Fortunately, this can be determined by using the whole path instead of just the final element as the previous design did. After the 4 is written out, the overall JSON output looks like this:

{
	"sequences": {
		"naturals": [0, 1, 2, 3, 4

This is a very non-committal state, in which stred is ready to either append to the array or end it. When it receives the "purple", it looks at the path of the previous value and that of the new one, and computes the characters it needs to output to position itself to output this new value. In this case, it computes that it needs to end an array, end an object, then output the "colour" key. After those come the value of "purple", after which the overall JSON output looks as follows:

{
	"sequences": {
		"naturals": [0, 1, 2, 3, 4]
	},
	"colours": "purple"

Again, stred doesn't commit to the continuation or the end of the object. At this point, no more values are output, and so stred computes that it needs to close one more object before it exits.

A nice side effect of this is that the values to start arrays and objects become mostly optional. If the path of a value includes an array or object that doesn't yet exist, stred can begin it automatically.

Values and subex

This sorts out how stred can stream values in and out, but what about subexes, which previously relied on the token-based approach? Like regular expressions, substitute expressions take a finite array of simple elements as input. Tokens were great for this, but it turned out they also made it hard to work with more complex values. Not a good trade-off!

It would be nice to treat values as elements, but this is problematic for data structures. Arrays and objects are a single value as well as being many values. How can a subex process that?

Non-determinism to the rescue

Already, subexes heavily use non-determinism (I highly recommend reading more about this if it interests you!), and this problem can be solved by leaning on it a bit more. Arrays are sumultaneouslly a single value and many values? Then stred literally processes them as both at the same time.

For this to work, a subex for an array must be able to match either the array as a value, or its parts as a list of values (or both). There is already plenty of syntax for dealing with values, so it makes more sense to introduce new syntax to indicate deconstructing the array, and then reconstructing it again afterwards.

The @(.)@ uses @( to deconstruct the array into its elements, and runs the contained subex on each of the elements. In this case . is used to do nothing with the elements. Finally, the elements are reconstructed back into the array with )@.

Let's see some more examples:

Increment all numbers in an array

@(.`1`+)@

Here, @( )@ is used to indicate we are deconstructing (left hand @) and then reconstructing (right hand @) an array.

Sum all numbers in an array

@(.)-+

Here, @( )- is used to indicate we are deconstructing an array (left hand @) but not constructing a new data structure, just leaving the parts in place of the original array.

Get an array of the keys of an object

#(.(.$_))@

Here, #( )@ is used to indicate we are deconstructing an object (left hand #) and constructing an array. When breaking down an object the inner part has two values, the key and the value.

Increment the number at path `"people" 3 "age"`

#(~{people}~@{(.{2})#(~{age}~.`1`+)#(.{-0})}@)#

Here, we use curly brackets so the contents of the destructuring expression match the whole contents of the array, instead of individual elements. We also use ~{ }~ to break strings into individual characters. The syntax could probably do with some improving still, but the structure of the expression matches closely the structure of the value we are transforming.

More general example

Extract the JSON value at path `"people" 0`

stred -n 'S/~{people}~0$_(.{-0})/p'

With this command, the -n flag tells stred not to automatically output every value after the commands finish. S// matches and substitutes the path. If the path starts with "people" 0 then it removes that prefix, if not the next command is skipped. That next command is p which outputs the value, with a path missing that prefix. This means any value passing through that is part of the desired value is allowed through, but only after it's path is changed as if our chosen object were the root.

For this example, the difference to the previous implementation is subtle. The subex is fairly different, but mostly only in syntax. The semantic difference is that we now need to remove the prefix from the path. If we didn't, stred would output the larger object structure, but only containing this one value.

Advantages

Working with values instead of tokens is much more intuitive.
Transforming large/complex values is much easier.
The parser knows which subexes are operating on individual characters. This removes the need for syntax to distinguish a 0 character from a 0 number, and also makes it possible to use whitespace outside of strings in subexes.

Disadvantages

Considerable new syntax

Syntax for terminals of data structures is no longer required, however, syntax has had to be introduced for the start and end of each structure. Furthermore, syntax is needed to indicate if we are iterating or looking at all the content in one go, and it might be necessary/desired to introduce even more down the line. This new syntax adds complexity as well as making subexes harder to read and write.

Running commands at the end of a data structure is extremely difficult

The new design has the empty object/array at the beginning of a structure, but nothing at its end. This means operations like appending to an array are just painful. We'd need to set some boolean when we enter an array and then if that boolean is set and we are not in the array we must have just left it. Then we have to unset the boolean, backup the path and value somewhere so we can use the path and value for the end of the array, print, then bring back whatever we backed up. Yuck!

Subexes dealing with dynamic levels of nesting are now impossible to write

Consider an array of arrays of arrays etc. We don't know how deep the nesting goes and want to flatten it. Before, we could just remove all the array terminals, but now the level of nesting we are working with is static within a single subex, so we'd need a loop running a subex several times. Honestly, having to use a loop isn't much worse and needing to do this at all is fairly rare. I don't think I'll fix it.

Existing problems that are still unresolved

In my last post, I listed quite a few problems with the design. I reiterate here those that still need addressing. For more detail on these issues, have a look at my last post.

Subex may be unnecessarily complex.
The N command.

Conclusion

I think these changes to the design are a move in the right direction, and being able to work more easily with values instead of tokens opens up new opportunities to address other problems as well. I did partially implement these changes in the go prototype around commit bed0e712deda5038f52e495bacae003098df7a55, so feel free to have a look there for how an implementation of these ideas might look. Do note that I didn't finish the implementation so you can't really use it for anything.

If you've found this interesting and want to discuss it with me, you can reach me at: shtanton at shtanton dot xyz

Next I'd like to look at either finding a way for the N command to work well, or maybe changing the way numbers work.