<- Back to shtanton's homepage
aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 79cfa8e4c5669111a35a17c03e1ce9e63f34c863 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# stred
## Streaming Tree Editor

This README is a work in progress. It is not very clear and is also probably out of date

Stred is like sed but for tree data structures (currently only JSON).
It has a lot in common with the pattern: `gron | sed | ungron`.

An input JSON file is decoded into lexical tokens, and then a program is run on each of these tokens.
These programs are simply a list of commands.

## Why?

jq is alright, but it's a full on language, it's massive.
stred aims to be much more like a tool than a language, whether it succeeds in that aim, we'll see.
I'm hoping it will be capable of nearly as much as jq, with comparable or better performance, but far easier to learn and use.

## Usage

Below is something between a tutorial and a reference guide, but before that let's go through an example.
Suppose that we want to count how many children the JSON root node has.
For this example we'll assume that the JSON root is either an array or an object.

First of all, our overall input will be a JSON value and our output will be a number, so it makes sense to use the `-n` flag which disables the automatic outputting of every token.
This flag is really handy when we aren't keeping most of the data the same and only making a few changes.

Now we know we want a stred script that outputs the number of children of the root node, so we decompose the problem and think about how exactly we want to count that?
Knowing that every stred script is executed by running it on each token of the overall JSON value, the logical steps are as follows:

* 1 - When the program starts, initialise some counter to 0
* 2 - Each time we pass a token that is a child of the root node, increment the counter
* 3 - When the program ends, output the counter

Starting with step 1, we decide that out counter is going to be a single numeric atom stored in the X register.
We then break down step 1 into smaller steps.

* 1.1 - Are we at the start of execution?
* 1.2 - If we are, set the content of the X register to 0

Focusing on step 1.1, to test if we are at the beginning of the program, we can test for the very first `{` or `[` of the JSON input.
Those tokens have an empty path (as they are part of the root node), so we test for them with two steps.

* 1.1.1 - Is the path empty?
* 1.1.2 - Is the value either `{` or `[`?

We want to test that *both* of these conditions are true and if they are we will have some number of commands we want to run.

For step 1.1.1, we want to substitute the path for itself if it is empty, so we just use an empty path substitution: `S//`.

Whatever command comes after it will be executed only if it succeeds, so we then check the value register with a substitution that doesn't change the value register but checks that it is the start of an array or object: ``s/[`{[`]/``.

We will want to put some commands after these (that will only be run if both succeed) so we put some curly brackets `{}` to encapsulate the commands we'll use for step 1.2 so that the whole group of them are only run at the start of the program.

This gives us our step 1.1 commands: ``S//s/[`{[`]/{}``.

Now for step 1.2, we want to set the X register to 0. We'll use a substitution to introduce the 0 but since we can't substitute on the X register directly, we break it down into smaller steps.

* 1.2.1 - Delete the contents of the value register
* 1.2.2 - Set the value register to 0
* 1.2.3 - Swap the value register with the X register

Step 1.2.1 is easy, just a `d` command.

Step 1.2.2 uses a substitution on the value register, this time we want to use the output `=<stuff>=` syntax to output a 0 to the value register.
We could use ``s/=`0`=/`` for this, but it's too much effort and there is a shorthand that we can use instead of ``=`0`=`` so we have `~0~` instead.
Even better, when we use `~` as the delimiter for a substitution it is still included in the subex so we can also leave out the `/`s.
This means for step 1.2.2 our command is simply `s~0~`.

Step 1.2.3 is also easy, just a `x` command.

This means our command for step 1.2 is `ds~0~x` and combining it with step 1.1: ``S//s/[`{[`]/{ds~0~x}``.

Now we carry on, let's look at step 2

* 2.1 - Is the token a child of the root node?
* 2.2 - If it is, increment the number in the X register

Before we break down 2.1 into commands, we need to make an observation.
If a child value is an array, then it has two tokens that are children of the root node, a start and an end.
Objects have the same problem.
We want to make sure we only count one of these, so we'll only be counting the starting tokens.

* 2.1.1 - Does the path have a length of 1?
* 2.1.2 - Is the value something other than an array or object closer?

For step 2.1.1, we again just use a path substitution, with `,` that represents any single JSON value (null, boolean, number or string): `S/,/`.

For step 2.1.2, we want to check that it is either a value `,` or the beginning of an array or object so we use ``s/,|[`{[`]/``.
The `|` subex operator will try the subexes on both sides of it and `[]` are used to recognise one of several options, in this case the options are the start of an object and the start of an array.

Once again we will want to run all of the step 2.2 commands only if both of these checks pass, so we add some more curly brackets for the step 2.2 commands after them: ``S/,/s/,|[`{[`]/{}``.

Breaking down step 2.2, we know that our count is in the X register but we can't substitute there so we'll need to do all the work in the value register.

* 2.2.1 - Swap the value register and X register
* 2.2.2 - Increment the value register
* 2.2.3 - Swap the value register and X register

Steps 2.2.1 and 2.2.3 are easy and we've seen them before: `x`.

Step 2.2.2 is interesting as we will need a subex that increments a value.
Subexes can add the outputs of other subexes with the `+` operator.
We want to add the input to 1, so we use `%` to copy across the input and then `~1~` to output a 1.
Both of these outputs get captured by the `+` which goes after them and adds them, giving us the increment we desire: `s/%~1~+/`

With this, we have `xs/%~1~+/x` for step 2.2 and ``S/,/s/,|[`{[`]/{xs/%~1~+/x}`` for the whole of step 2.

Finally, we break down step 3

* 3.1 - Are we at the end of the JSON input?
* 3.2 - If we are, output the content of the X register

Step 3.1 is very similar to step 1.1, so I won't break it down any further, hopefully it's clear that what we want is ``S//s/[`}]`]/{}`` so step 3.2 can go inside the curly brackets.

Step 3.2 is very simple.
stred has a command, `p`, to output the contents of the value register so to output what's in the X register, we use `xp`.

This means the commands for step 3 are: ``S//s/[`]}`]/{xp}``.

Because the three steps don't overlap on which parts of the JSON input they work with, the order we write them in is actually arbitrary.
Since we developed our solution going step 1, then 2, then 3 I'll write it that way, but personally why I needed this script I found it easier to write the increment part first, so I did.
stred is very good at letting you write your commands in the order you think of them, which is handy because although writing stred scripts is manageable, reading them is painful.

Our final script is: ``` S//s/[`{[`]/{ds~0~x} S/,/s/,|[`{[`]/{xs/%~1~+/x} S//s/[`]}`]/{xp} ```

### Registers and Atoms
Commands operate on data that is store in 'registers'. There are 5 registers: path, value, X, Y and Z.

The path register stores the global path from the root node to the current token, the value register stores the token itself and the X, Y and Z registers are general purpose registers, good for keeping things for later.

With the following JSON being read:
```
{
  "nest1": {
    "nest2": "data"
  }
}
```
When the `"data"` token is read, the path register will be set to `"nest1""nest2"` and the value register will be set to `"data"`.

Other than path and value being automatically updated with each value, they all function identically. Each of them stores a list of 'atoms'.
For simple tokens (null, booleans, numbers and the starts and ends of objects and arrays) the tokens are themselves atoms. To better deal with strings though, they are split into smaller atoms.
A string is comprised of a StringTerminal atom, usually written `"`, then an atom for each unicode codepoint (character) in the string, and finally another `"` atom.
This is why the path I showed earlier was `"nest1""nest2"` instead of something like `["nest1", "nest2"]`.

### Substitutions

The substitution commands are the most important by far.
The substitution command `s` is always followed by a 'subex', usually wrapped in `/`.
For example, to substitute a single atom for itself, we would use the command `s/./`.
`.` is the subex for read in an atom and output it unchanged, so it's all we need.
The `s` command makes changes to the value register based on the subex, but as well as transforming the input into an output, a subex will also either accept or reject just like a traditional regex.
The command immediately after a substitution will only be run if the subex accepts.
`p` is the print command, and the `-n` flag will disable the default behaviour of stred to print by default, so if we want to only print tokens with an empty path (i.e. that are part of the root node), we use `S//p`.
Notice uppercase `S` is used to apply the substitution to the path register instead of the value register.
Running `stred -n 'S//p'` on the JSON above would simply output `{}`, as the root tokens are just beginning and ending the object.

### Substitute Expressions

The power of stred comes from substitute expressions, or subexes for short.
They operate on a list of atoms, and will either produce a new list of atoms, or reject.

#### Basic subexes

The simplest subexes are literals. These just copy directly from the input to the output.

| Syntax | Description |
| --- | --- |
| `.` | Copy across any single atom unchanged |
| `,` | Copy across any single JSON value (not `{`, `}`, `[` or `]` tokens) unchanged (will copy a whole string). Equivalent to `` `null`\|?\|%\|# `` |
| `?` | Copy across any single boolean atom |
| `%` | Copy across any single number |
| `_` | Copy across a single unicode codepoint inside a string |
| `"` | Copy across a string terminal (start or end of a string) |
| `#` | Copy across a string value. Equivalent to `"_{-0}"` |
| `` `<null\|bool\|number\|terminal>` `` | Copy across a specific boolean (true/false), a specific number or a specific object/array terminal |
| `<character>` | Copy across the specific character |
| ``[a-z`null`]`` | Will match and copy across any of the specified literals. A `-` can be used to create a range of characters |

If the input doesn't match then the subex rejects.
For example, `a` copies across specifically the character `a` inside a string, anything else will reject.
Literals like `%` will copy across any number but will reject an atom like `a` that is not a number.

#### Concatenation and Alternation

Writing a series of subexes one after another will concatenate them.
This means that the first reads some atoms in and writes some out, then the second etc.
The first subex effectively splits the input into a chunk that it will transform, and a chunk for everything else to transform which gets passed over to the second subex.
If either reject, the whole subex rejects.

Putting `|` between two subexes will try two options, so `"yes"|"no"` will accept and copy across either the string `"yes"` or the string `"no"` but nothing else.

#### Replacement subexes

The above are good for matching, but don't allow for transformation, so we can also add and remove with subexes.

| Syntax | Description |
| --- | --- |
| `=<content>=` | Accepts no input but outputs `content`, which is made up of specific literals |
| `<drop>:<content>:` | Runs `drop`, accepting input as it does, then discards any output and outputs `content` instead |
| `[a-z=A-Z]` | Accepts one of the characters from left of `=` as input and outputs a corresponding character from the right as output. If there are more input options than output, the output range is looped |

A useful construct is ``` [`{``}`=`[``]`] ``` which will replace an object terminal with the corresponding array terminal.
To make using these literals easier, they can be listed inside a single pair of `` ` ``.
`` `1 2 {}null` `` is equivalent to ``` `1``2``{``}``null` ```.

There a few other helpful shortcuts for the output syntax

| Syntax | Equivalent |
| --- | --- |
| `~true~` | ``=`true`=`` |
| `^hello^` | `="hello"=` |

#### Slots

To allow for rearranging data, there is a way to store the output of a subex into a 'slot', which can then be referenced later.

`<will_be_stored>$a` will store the output of `will_be_stored` into the slot `a`.
These slots can be referenced in the output syntax so to output the contents of `a` we would use `=$a=`.
As an example, to swap two atoms, we can use `.$a.=$a=`.
Read: read the first atom and store it in slot `a`, then copy across the second atom, then output the content of slot `a` (which is the first atom).

#### Repetition

An extremely versatile feature of regexes and subexes is repetition.
In subexes this is done with a postfix operator `{...}` where `...` defines the acceptable number of repetitions.
Regex has an operator that 'repeats' either once or zero times, and is greedy so if once and zero times are both valid it will use once.
In subex, this would look like `{1,0}`. The priority is ordered left to right so if we didn't want it to be greedy we would use `{0,1}`.
If we wanted to repeat something as many times, up to a maximum of 5, we would use a range `{5-0}`.
If we want unbounded repetition, we just leave out the maximum `{-0}`.

#### Arithmetic

Subexes can also do arithmetic by operating on the output of the subex before.

| Syntax | Description |
| --- | --- |
| `+` | Sum |
| `*` | Product |
| `-` | Negate all atoms individually |
| `/` | Take the reciprocal of all atoms individually |
| `!` | Boolean NOT all atoms individually |

These will all try to cast types to work as best they can.

For example, to sum all atoms in the input and output the result, we use `.{-0}+`.
We read in and copy out as many atoms as possible (all of them), and then we sum all that output.

### Commands

With an understanding of subexes, we can look at the stred commands

| Command | Description |
| --- | --- |
| `s/<subex>/` | Run `subex` on the value register. If it accepts, replace the value register with the output. If not skip the next command |
| `S/<subex>/` | Same as `s` but on the path register |
| `f/<subex>/` | Shorthand for `S/<subex>.{-0}/` |
| `F/<subex>/` | Shorthand for `S/<subex>(.{-0}::)/` |
| `l/<subex>/` | Shorthand for `S/.{0-}<subex>/` |
| `L/<subex>/` | Shorthand for `S/.{0-}::<subex>/` |
| `a/<subex>/` | Shorthand for `s/<subex>\|.{-0}` |
| `A/<subex>/` | Shorthand for `S/<subex>\|.{-0}` |
| `p` | Print whatever is in the value register |
| `d` | Delete whatever is in the value register |
| `D` | Delete whatever is in the path register |
| `n` | Skip to the next token |
| `N` | Load the next token and append it's value to the value register and replace the path register with the new token's path |
| `o` | Do nothing |
| `x` | Swap the value register with the X register |
| `X` | Append the contents of the value register to the X register |
| `y` | Swap the value register with the Y register |
| `Y` | Append the contents of the value register to the Y register |
| `z` | Swap the value register with the Z register |
| `Z` | Append the contents of the value register to the Z register |
| `k` | Swap the value register with the path register |
| `K` | Append the contents of the value register to the path register |