Stred Part 3: No More Path Space
2024-05-06This post is the third part of a series starting here. It might not make much sense without reading the earlier parts first. In those parts, I developed an idea for a version of sed that worked on JSON by iterating over a stream of path-value pairs.
One significant drawback of the previous design was that there wasn't a practical way to implement the N command (read in the next value and append to the current work space). This is because it would really need to keep the paths of the original and the new values available, which just wasn't practical. Solving this problem will be the driving force of the changes I make in this post.
Abandoning the path space
All previous designs have had a path space which stores the path of the value being worked on. For example, with this input:
{ "numbers": [5, 3, 7] }
If the value 3
had just been read into the work space, then "numbers" 1
will have been read into the path space.
When that value is written out, stred would make sure that it ended up at the path specified
(see the last post for details).
This is an issue if we want to load in the next value (which would be 7
in the above example)
while keeping 3
still around.
Now the path needs to be both "numbers" 1
and "numbers" 2
.
I experimented briefly with ways of keeping several paths in the path space, but it is very awkward to use. What happens if there are 3 paths and only two values in the work space? Or vice versa?
I eventually decided to do away with the path space entirely.
Combining paths and values
We can solve this by storing the path inside the value itself,
so paths and values must always stay in pairs.
The way we'll do this is by keeping the relevant parts of the surrounding structures.
For example, instead of just 3
the work space would contain
{"numbers": [1: 3]}
This gives access to both the path and the value in one unit. One interesting note here is that arrays have keys, as without them there's no way to check the index of the value in the overall array.
This makes the outputting tricky though, continuing the example above, the values that pass through the program, in order, are:
{} {"numbers": []} {"numbers": [0: 5]} {"numbers": [1: 3]} {"numbers": [2: 7]} {"numbers": []} {}
Parts of the input are repeated many times so they need to be "merged" back together as they are output. Each time a value is output, it gets merged onto the end of what has already been output. This is fairly similar to what stred was already doing, but can also handle cases like the following:
{"numbers": [3: 10, 4: 11, 5: 12], "other": null}
In this case, a lot more data gets written out.
Were this to be output after another value in the "numbers"
array,
the arrays would be merged.
If not then the array would be created.
Either way, the "other"
value would come after.
With all this in place, we can get rid of the path space and add the N command!
Start and end commands
You may have noticed that structures now have empty values read in at the beginning and end,
not just at the beginning as they did before.
To be able to tell them apart, we have the new a
and e
commands.
The a
command will only run the command after it if the most recently read value
was the start of an overal input value, such as the []
at the start of reading
an array.
Similarly, e
checks if it is an end.
For simple values that are read all at once (null, booleans, numbers and strings),
they are neither a start or an end.
These new commands make it easy to append or prepend to structures.
The commands below prepend the number 10 to the "numbers"
array.
It uses a
to only touch the empty array read in at the start of the array,
and insert a new value to it.
as/#( "numbers" @( `0 10` )@ )#/
Yes, I accidentally changed the destructuring syntax since the last post.
They work by checking if the value in the work space is the start of a structure,
and if it is, use a subex to insert a new item if that structure is the
"numbers"
array.
We also introduce A
and E
, which check if the previous value was
the start of a value or if the next will be the end of a value respectively.
We will see later what these can be helpful for.
The merge command
If we were to use N to load another value into the work space, we'd find ourselves with something like this:
{"numbers": [1: 3]} {"numbers": [2: 7]}
A more convenient way of storing basically the same information would be:
{"numbers": [1: 3, 2: 7]}
This is what the merge command (m) does. It merges the last two values in the workspace together in place (going from the first snippet above, to the second).
It also finally unlocks the ability to load a whole value into the workspace, even if it is a structure. This will be extremely useful, as we'll see later. Here's how we do it:
- Is the current value the start of the target structure? If not, skip remaining steps
- Use N to load the next value and append it to the work space
- Is the newly loaded value the end of the target structure? If yes, skip to step 6
- Merge the new value with the existing value
- Go to step 2
- Merge the new value with the existing value
Effectively, when we detect the start of the desired structure, we use N and m to load
more of it until we reach the end.
Below are the stred commands we'd use to load the array at the path "numbers"
.
as/#( "numbers" @()@ )#/
:l N
es/.{-0} #( "numbers" @()@ )#/be
m
bl
:e m
All together we have
as/#( "numbers" @()@ )#/{ :l N es/.{-0} #( "numbers" @()@ )#/be m bl :e m }
This is quite long and repetitive, and loading a structure into the workspace is useful enough that I added a shorthand:
M/#( "numbers" @()@ )#/
Which does (pretty much) the same thing as the earlier commands. It will also only run the command after it if it matches (similar to the s command).
When loading a full value,
we might not know which structure to use or even if it is a structure at all.
To make things easier, the ,
(comma) subex matches non-structures and empty structures.
This means that M/#( "numbers" , )#/{ ... }
will run the commands in curly brackets on
the single value if it is not a structure, or will load a whole structure and then run the commands
if it is.
Example: Delete the last element of an array
This is an interesting problem for a utility like this.
Deleting a structure means deleting all of the parts, starting with the first,
but to know if it is the last element, we need to check the value read just after the last part.
These can be arbitrarily far apart, and currently we only check one value into the future (using E
).
We can solve this using M
to first make sure we have the whole value loaded,
that way if the next element is the end of a structure, the whole final value can be discarded.
The below will delete the last element in the "people"
array.
stred 'M/#( "people" @( . , )@ )#/{ Ed }'
If you want to play around with this, the above command (and all others in this post, to my knowledge) will actually execute correctly with the latest commit at the time of writing (commit b434fe4e14f6dcc8d1d7433a29351b8e8ea77d37).
Example: Compute full names
This example accepts input in the form listed below and will produce an array of all the full names by joining the first and last names together with a space in between.
{ "people": [ { "first_name": "...", "last_name": "...", "age": ..., ... }, ... ], ... }
To start with, for each person structure, we load the whole person object into the memory space:
M/#( "people" @( . , )@ )#/
Then we need a subex that iterates the fields of the map and stores the contents of the
"first_name"
and "last_name"
strings into slots.
Then it should output the joined string.
We use #( "people"$_ ... )-
to destruct the overall map into it's elements and discard
the "people"
key.
The value of that key is an array with a single element, the person object.
We want to keep the same index when we output the name, so we use
@( . ... )@
to access the element of the array without changing anything else about it.
Now, considering just the person map itself, we want to iterate the fields, storing the first name and
last name strings into slots.
We actually store the contents of the strings (a list of characters, which is different to a string).
Square brackets let us iterate the contents of a data structure, and we use $
to store
the first name in a
and the last name in b
.
#[ ("first_name" ".{-0}$a") | ("last_name" ".{-0}$b") | (..) ]#
We then want to discard the person map and in its place create the full name string.
We use `"$a $b"`
to output the two parts of the name with a space in between
and $_
to discard the rest of the person map.
Once we are done processing values, we use p to output the value. The full stred command is:
stred -n 'M/#( "people" @( . , )@ )#/{ s/#( "people"$_ @( . (#[ ("first_name" ".{-0}$a") | ("last_name" ".{-0}$b") | (..) ]#$_) `"$a $b"` )@ )-/p }'
Advantages over the previous design
- Working with several values with different paths at the same time is now possible.
- Loading a whole data structure from the input is now possible.
- Running commands at the end of a data structure is now much easier.
Disadvantages
Slightly more cumbersome syntax
Including all the brackets wrapping the layers of structures is quite a lot,
and the need for brackets to be balanced is awkward.
To illustrate this, consider getting a value at the path "first" "second" "third"
,
shown below with the previous version of stred first and the new one second.
S/("first" "second" "third")$_/p s/#( "first"$_ #( "second"$_ #( "third"$_ . )- )- )-/p
Existing problems that remain unresolved
There are still some issues from previous posts that haven't been addressed, I list them here to give an indication of progress but for detail on these check the previous two posts.
- Subex may be unnecessarily complex.
- Excessive syntax
Conclusion
These are smaller changes than I've written about previously, and my list of outstanding problems continues to shrink. The go prototype at commit b434fe4e14f6dcc8d1d7433a29351b8e8ea77d37 is in a pretty good state. It has all of the above implemented and is decently usable (but totally undocumented). Send me an email if you want to experiment with it and I'll write some docs.
If you've found this interesting and want to discuss it further, you can reach me at: shtanton at shtanton dot xyz
My next plan is to totally overhaul the way numbers are processed in a way that
a) fits better with the style of subex and
b) will remove the need for all the syntax dedicated to arithmetic operations.
That second point will hopefully give me a lot more flexibility to improve the syntax once characters like
+
, *
and -
aren't dedicated to arithmetic.