<- Back to shtanton's homepage
aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 032a71d4bcf3f1cb928a1c0fbacd6e8300303e72 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# stred
## Streaming Tree Editor

This README is a work in progress. It is not very clear and is also probably out of date

Stred is like sed but for tree data structures (currently only JSON).
It has a lot in common with the pattern: `gron | sed | ungron`.

An input JSON file is decoded into lexical tokens, and then a program is run on each of these tokens.
These programs are simply a list of commands.

## Why?

jq is alright, but it's a full on language, it's massive.
stred aims to be much more like a tool than a language, whether it succeeds in that aim, we'll see.
I'm hoping it will be capable of nearly as much as jq, with comparable or better performance, but far easier to learn and use.

## Usage

### Registers and Atoms
Commands operate on data that is store in 'registers'. There are 3 registers: path, value and X.

The path register stores the global path from the root node to the current token, the value register stores the token itself and the X register is a general purpose register, good for keeping things for later.

With the following JSON being read:
```
{
  "nest1": {
    "nest2": "data"
  }
}
```
When the `"data"` token is read, the path register will be set to `"nest1""nest2"` and the value register will be set to `"data"`.

Other than path and value being automatically updated with each value, they all function identically. Each of them stores a list of 'atoms'.
For simple tokens (null, booleans, numbers and the starts and ends of objects and arrays) the tokens are themselves atoms. To better deal with strings though, they are split into smaller atoms.
A string is comprised of a StringTerminal atom, usually written `"`, then an atom for each unicode codepoint (character) in the string, and finally another `"` atom.
This is why the path I showed earlier was `"nest1""nest2"` instead of something like `["nest1", "nest2"]`.

### Substitutions

The substitution commands are the most important by far.
The substitution command `s` is always followed by a 'subex', usually wrapped in `/`.
For example, to substitute a single atom for itself, we would use the command `s/./`.
`.` is the subex for read in an atom and output it unchanged, so it's all we need.
The `s` command makes changes to the value register based on the subex, but as well as transforming the input into an output, a subex will also either accept or reject just like a traditional regex.
The command immediately after a substitution will only be run if the subex accepts.
`p` is the print command, and the `-n` flag will disable the default behaviour of stred to print by default, so if we want to only print tokens with an empty path (i.e. that are part of the root node), we use `S//p`.
Notice uppercase `S` is used to apply the substitution to the path register instead of the value register.
Running `stred -n 'S//p'` on the JSON above would simply output `{}`, as the root tokens are just beginning and ending the object.

### Substitute Expressions

The power of stred comes from substitute expressions, or subexes for short.
They operate on a list of atoms, and will either produce a new list of atoms, or reject.

#### Basic subexes

The simplest subexes are literals. These just copy directly from the input to the output.

| Syntax | Description |
| --- | --- |
| `.` | Copy across any single atom unchanged |
| `,` | Copy across any single JSON value (not `{`, `}`, `[` or `]` tokens) unchanged (will copy a whole string). Equivalent to `` `null`\|?\|%\|# `` |
| `?` | Copy across any single boolean atom |
| `%` | Copy across any single number |
| `_` | Copy across a single unicode codepoint inside a string |
| `"` | Copy across a string terminal (start or end of a string) |
| `#` | Copy across a string value. Equivalent to `"_{-0}"` |
| `` `<null\|bool\|number\|terminal>` `` | Copy across a specific boolean (true/false), a specific number or a specific object/array terminal |
| `<character>` | Copy across the specific character |
| ``[a-z`null`]`` | Will match and copy across any of the specified literals. A `-` can be used to create a range of characters |

If the input doesn't match then the subex rejects.
For example, `a` copies across specifically the character `a` inside a string, anything else will reject.
Literals like `%` will copy across any number but will reject an atom like `a` that is not a number.

#### Concatenation and Alternation

Writing a series of subexes one after another will concatenate them.
This means that the first reads some atoms in and writes some out, then the second etc.
The first subex effectively splits the input into a chunk that it will transform, and a chunk for everything else to transform which gets passed over to the second subex.
If either reject, the whole subex rejects.

Putting `|` between two subexes will try two options, so `"yes"|"no"` will accept and copy across either the string `"yes"` or the string `"no"` but nothing else.

#### Replacement subexes

The above are good for matching, but don't allow for transformation, so we can also add and remove with subexes.

| Syntax | Description |
| --- | --- |
| `=<content>=` | Accepts no input but outputs `content`, which is made up of specific literals |
| `<drop>:<content>:` | Runs `drop`, accepting input as it does, then discards any output and outputs `content` instead |
| `[a-z=A-Z]` | Accepts one of the characters from left of `=` as input and outputs a corresponding character from the right as output. If there are more input options than output, the output range is looped |

A useful construct is ``` [`{``}`=`[``]`] ``` which will replace an object terminal with the corresponding array terminal.
To make using these literals easier, they can be listed inside a single pair of `` ` ``.
`` `1 2 {}null` `` is equivalent to ``` `1``2``{``}``null` ```.

There a few other helpful shortcuts for the output syntax

| Syntax | Equivalent |
| --- | --- |
| `~true~` | ``=`true`=`` |
| `^hello^` | `="hello"=` |

#### Slots

To allow for rearranging data, there is a way to store the output of a subex into a 'slot', which can then be referenced later.

`<will_be_stored>$a` will store the output of `will_be_stored` into the slot `a`.
These slots can be referenced in the output syntax so to output the contents of `a` we would use `=$a=`.
As an example, to swap two atoms, we can use `.$a.=$a=`.
Read: read the first atom and store it in slot `a`, then copy across the second atom, then output the content of slot `a` (which is the first atom).

#### Repetition

An extremely versatile feature of regexes and subexes is repetition.
In subexes this is done with a postfix operator `{...}` where `...` defines the acceptable number of repetitions.
Regex has an operator that 'repeats' either once or zero times, and is greedy so if once and zero times are both valid it will use once.
In subex, this would look like `{1,0}`. The priority is ordered left to right so if we didn't want it to be greedy we would use `{0,1}`.
If we wanted to repeat something as many times, up to a maximum of 5, we would use a range `{5-0}`.
If we want unbounded repetition, we just leave out the maximum `{-0}`.

#### Arithmetic

Subexes can also do arithmetic by operating on the output of the subex before.

| Syntax | Description |
| --- | --- |
| `+` | Sum |
| `*` | Product |
| `-` | Negate all atoms individually |
| `/` | Take the reciprocal of all atoms individually |
| `!` | Boolean NOT all atoms individually |

These will all try to cast types to work as best they can.

For example, to sum all atoms in the input and output the result, we use `.{-0}+`.
We read in and copy out as many atoms as possible (all of them), and then we sum all that output.

### Commands

With an understanding of subexes, we can look at the stred commands

| Command | Description |
| --- | --- |
| `s/<subex>/` | Run `subex` on the value register. If it accepts, replace the value register with the output. If not skip the next command |
| `S/<subex>/` | Same as `s` but on the path register |
| `f/<subex>/` | Shorthand for `S/<subex>.{-0}/` |
| `F/<subex>/` | Shorthand for `S/<subex>(.{-0}::)/` |
| `l/<subex>/` | Shorthand for `S/.{0-}<subex>/` |
| `L/<subex>/` | Shorthand for `S/.{0-}::<subex>/` |
| `a/<subex>/` | Shorthand for `s/<subex>|.{-0}` |
| `A/<subex>/` | Shorthand for `S/<subex>|.{-0}` |
| `p` | Print whatever is in the value register |
| `d` | Delete whatever is in the value register |
| `D` | Delete whatever is in the path register |
| `n` | Skip to the next token |
| `N` | Load the next token and append it's value to the value register and replace the path register with the new token's path |
| `o` | Do nothing |
| `x` | Swap the value register with the X register |
| `X` | Append the contents of the value register to the X register |
| `k` | Swap the value register with the path register |
| `K` | Append the contents of the value register to the path register |