stred-go

Stred: Streaming Tree Editor. Like sed but for JSON. This is the go implementation
git clone https://shtanton.xyz/git/stred-go.git
Log | Files | Refs | README

README.md (16070B)


      1 # stred
      2 ## Streaming Tree Editor
      3 
      4 This README is a work in progress. It is not very clear and is also probably out of date
      5 
      6 Stred is like sed but for tree data structures (currently only JSON).
      7 It has a lot in common with the pattern: `gron | sed | ungron`.
      8 
      9 An input JSON file is decoded into lexical tokens, and then a program is run on each of these tokens.
     10 These programs are simply a list of commands.
     11 
     12 ## Why?
     13 
     14 jq is alright, but it's a full on language, it's massive.
     15 stred aims to be much more like a tool than a language, whether it succeeds in that aim, we'll see.
     16 I'm hoping it will be capable of nearly as much as jq, with comparable or better performance, but far easier to learn and use.
     17 
     18 ## Usage
     19 
     20 Below is something between a tutorial and a reference guide, but before that let's go through an example.
     21 Suppose that we want to count how many children the JSON root node has.
     22 For this example we'll assume that the JSON root is either an array or an object.
     23 
     24 First of all, our overall input will be a JSON value and our output will be a number, so it makes sense to use the `-n` flag which disables the automatic outputting of every token.
     25 This flag is really handy when we aren't keeping most of the data the same and only making a few changes.
     26 
     27 Now we know we want a stred script that outputs the number of children of the root node, so we decompose the problem and think about how exactly we want to count that?
     28 Knowing that every stred script is executed by running it on each token of the overall JSON value, the logical steps are as follows:
     29 
     30 * 1 - When the program starts, initialise some counter to 0
     31 * 2 - Each time we pass a token that is a child of the root node, increment the counter
     32 * 3 - When the program ends, output the counter
     33 
     34 Starting with step 1, we decide that out counter is going to be a single numeric atom stored in the X register.
     35 We then break down step 1 into smaller steps.
     36 
     37 * 1.1 - Are we at the start of execution?
     38 * 1.2 - If we are, set the content of the X register to 0
     39 
     40 Focusing on step 1.1, to test if we are at the beginning of the program, we can test for the very first `{` or `[` of the JSON input.
     41 Those tokens have an empty path (as they are part of the root node), so we test for them with two steps.
     42 
     43 * 1.1.1 - Is the path empty?
     44 * 1.1.2 - Is the value either `{` or `[`?
     45 
     46 We want to test that *both* of these conditions are true and if they are we will have some number of commands we want to run.
     47 
     48 For step 1.1.1, we want to substitute the path for itself if it is empty, so we just use an empty path substitution: `S//`.
     49 
     50 Whatever command comes after it will be executed only if it succeeds, so we then check the value register with a substitution that doesn't change the value register but checks that it is the start of an array or object: ``s/[`{[`]/``.
     51 
     52 We will want to put some commands after these (that will only be run if both succeed) so we put some curly brackets `{}` to encapsulate the commands we'll use for step 1.2 so that the whole group of them are only run at the start of the program.
     53 
     54 This gives us our step 1.1 commands: ``S//s/[`{[`]/{}``.
     55 
     56 Now for step 1.2, we want to set the X register to 0. We'll use a substitution to introduce the 0 but since we can't substitute on the X register directly, we break it down into smaller steps.
     57 
     58 * 1.2.1 - Delete the contents of the value register
     59 * 1.2.2 - Set the value register to 0
     60 * 1.2.3 - Swap the value register with the X register
     61 
     62 Step 1.2.1 is easy, just a `d` command.
     63 
     64 Step 1.2.2 uses a substitution on the value register, this time we want to use the output `=<stuff>=` syntax to output a 0 to the value register.
     65 We could use ``s/=`0`=/`` for this, but it's too much effort and there is a shorthand that we can use instead of ``=`0`=`` so we have `~0~` instead.
     66 Even better, when we use `~` as the delimiter for a substitution it is still included in the subex so we can also leave out the `/`s.
     67 This means for step 1.2.2 our command is simply `s~0~`.
     68 
     69 Step 1.2.3 is also easy, just a `x` command.
     70 
     71 This means our command for step 1.2 is `ds~0~x` and combining it with step 1.1: ``S//s/[`{[`]/{ds~0~x}``.
     72 
     73 Now we carry on, let's look at step 2
     74 
     75 * 2.1 - Is the token a child of the root node?
     76 * 2.2 - If it is, increment the number in the X register
     77 
     78 Before we break down 2.1 into commands, we need to make an observation.
     79 If a child value is an array, then it has two tokens that are children of the root node, a start and an end.
     80 Objects have the same problem.
     81 We want to make sure we only count one of these, so we'll only be counting the starting tokens.
     82 
     83 * 2.1.1 - Does the path have a length of 1?
     84 * 2.1.2 - Is the value something other than an array or object closer?
     85 
     86 For step 2.1.1, we again just use a path substitution, with `,` that represents any single JSON value (null, boolean, number or string): `S/,/`.
     87 
     88 For step 2.1.2, we want to check that it is either a value `,` or the beginning of an array or object so we use ``s/,|[`{[`]/``.
     89 The `|` subex operator will try the subexes on both sides of it and `[]` are used to recognise one of several options, in this case the options are the start of an object and the start of an array.
     90 
     91 Once again we will want to run all of the step 2.2 commands only if both of these checks pass, so we add some more curly brackets for the step 2.2 commands after them: ``S/,/s/,|[`{[`]/{}``.
     92 
     93 Breaking down step 2.2, we know that our count is in the X register but we can't substitute there so we'll need to do all the work in the value register.
     94 
     95 * 2.2.1 - Swap the value register and X register
     96 * 2.2.2 - Increment the value register
     97 * 2.2.3 - Swap the value register and X register
     98 
     99 Steps 2.2.1 and 2.2.3 are easy and we've seen them before: `x`.
    100 
    101 Step 2.2.2 is interesting as we will need a subex that increments a value.
    102 Subexes can add the outputs of other subexes with the `+` operator.
    103 We want to add the input to 1, so we use `%` to copy across the input and then `~1~` to output a 1.
    104 Both of these outputs get captured by the `+` which goes after them and adds them, giving us the increment we desire: `s/%~1~+/`
    105 
    106 With this, we have `xs/%~1~+/x` for step 2.2 and ``S/,/s/,|[`{[`]/{xs/%~1~+/x}`` for the whole of step 2.
    107 
    108 Finally, we break down step 3
    109 
    110 * 3.1 - Are we at the end of the JSON input?
    111 * 3.2 - If we are, output the content of the X register
    112 
    113 Step 3.1 is very similar to step 1.1, so I won't break it down any further, hopefully it's clear that what we want is ``S//s/[`}]`]/{}`` so step 3.2 can go inside the curly brackets.
    114 
    115 Step 3.2 is very simple.
    116 stred has a command, `p`, to output the contents of the value register so to output what's in the X register, we use `xp`.
    117 
    118 This means the commands for step 3 are: ``S//s/[`]}`]/{xp}``.
    119 
    120 Because the three steps don't overlap on which parts of the JSON input they work with, the order we write them in is actually arbitrary.
    121 Since we developed our solution going step 1, then 2, then 3 I'll write it that way, but personally why I needed this script I found it easier to write the increment part first, so I did.
    122 stred is very good at letting you write your commands in the order you think of them, which is handy because although writing stred scripts is manageable, reading them is painful.
    123 
    124 Our final script is: ``` S//s/[`{[`]/{ds~0~x} S/,/s/,|[`{[`]/{xs/%~1~+/x} S//s/[`]}`]/{xp} ```
    125 
    126 ### Registers and Atoms
    127 Commands operate on data that is store in 'registers'. There are 5 registers: path, value, X, Y and Z.
    128 
    129 The path register stores the global path from the root node to the current token, the value register stores the token itself and the X, Y and Z registers are general purpose registers, good for keeping things for later.
    130 
    131 With the following JSON being read:
    132 ```
    133 {
    134   "nest1": {
    135     "nest2": "data"
    136   }
    137 }
    138 ```
    139 When the `"data"` token is read, the path register will be set to `"nest1""nest2"` and the value register will be set to `"data"`.
    140 
    141 Other than path and value being automatically updated with each value, they all function identically. Each of them stores a list of 'atoms'.
    142 For simple tokens (null, booleans, numbers and the starts and ends of objects and arrays) the tokens are themselves atoms. To better deal with strings though, they are split into smaller atoms.
    143 A string is comprised of a StringTerminal atom, usually written `"`, then an atom for each unicode codepoint (character) in the string, and finally another `"` atom.
    144 This is why the path I showed earlier was `"nest1""nest2"` instead of something like `["nest1", "nest2"]`.
    145 
    146 ### Substitutions
    147 
    148 The substitution commands are the most important by far.
    149 The substitution command `s` is always followed by a 'subex', usually wrapped in `/`.
    150 For example, to substitute a single atom for itself, we would use the command `s/./`.
    151 `.` is the subex for read in an atom and output it unchanged, so it's all we need.
    152 The `s` command makes changes to the value register based on the subex, but as well as transforming the input into an output, a subex will also either accept or reject just like a traditional regex.
    153 The command immediately after a substitution will only be run if the subex accepts.
    154 `p` is the print command, and the `-n` flag will disable the default behaviour of stred to print by default, so if we want to only print tokens with an empty path (i.e. that are part of the root node), we use `S//p`.
    155 Notice uppercase `S` is used to apply the substitution to the path register instead of the value register.
    156 Running `stred -n 'S//p'` on the JSON above would simply output `{}`, as the root tokens are just beginning and ending the object.
    157 
    158 ### Substitute Expressions
    159 
    160 The power of stred comes from substitute expressions, or subexes for short.
    161 They operate on a list of atoms, and will either produce a new list of atoms, or reject.
    162 
    163 #### Basic subexes
    164 
    165 The simplest subexes are literals. These just copy directly from the input to the output.
    166 
    167 | Syntax | Description |
    168 | --- | --- |
    169 | `.` | Copy across any single atom unchanged |
    170 | `,` | Copy across any single JSON value (not `{`, `}`, `[` or `]` tokens) unchanged (will copy a whole string). Equivalent to `` `null`\|?\|%\|# `` |
    171 | `?` | Copy across any single boolean atom |
    172 | `%` | Copy across any single number |
    173 | `_` | Copy across a single unicode codepoint inside a string |
    174 | `"` | Copy across a string terminal (start or end of a string) |
    175 | `#` | Copy across a string value. Equivalent to `"_{-0}"` |
    176 | `` `<null\|bool\|number\|terminal>` `` | Copy across a specific boolean (true/false), a specific number or a specific object/array terminal |
    177 | `<character>` | Copy across the specific character |
    178 | ``[a-z`null`]`` | Will match and copy across any of the specified literals. A `-` can be used to create a range of characters |
    179 
    180 If the input doesn't match then the subex rejects.
    181 For example, `a` copies across specifically the character `a` inside a string, anything else will reject.
    182 Literals like `%` will copy across any number but will reject an atom like `a` that is not a number.
    183 
    184 #### Concatenation and Alternation
    185 
    186 Writing a series of subexes one after another will concatenate them.
    187 This means that the first reads some atoms in and writes some out, then the second etc.
    188 The first subex effectively splits the input into a chunk that it will transform, and a chunk for everything else to transform which gets passed over to the second subex.
    189 If either reject, the whole subex rejects.
    190 
    191 Putting `|` between two subexes will try two options, so `"yes"|"no"` will accept and copy across either the string `"yes"` or the string `"no"` but nothing else.
    192 
    193 #### Replacement subexes
    194 
    195 The above are good for matching, but don't allow for transformation, so we can also add and remove with subexes.
    196 
    197 | Syntax | Description |
    198 | --- | --- |
    199 | `=<content>=` | Accepts no input but outputs `content`, which is made up of specific literals |
    200 | `<drop>:<content>:` | Runs `drop`, accepting input as it does, then discards any output and outputs `content` instead |
    201 | `[a-z=A-Z]` | Accepts one of the characters from left of `=` as input and outputs a corresponding character from the right as output. If there are more input options than output, the output range is looped |
    202 
    203 A useful construct is ``` [`{``}`=`[``]`] ``` which will replace an object terminal with the corresponding array terminal.
    204 To make using these literals easier, they can be listed inside a single pair of `` ` ``.
    205 `` `1 2 {}null` `` is equivalent to ``` `1``2``{``}``null` ```.
    206 
    207 There a few other helpful shortcuts for the output syntax
    208 
    209 | Syntax | Equivalent |
    210 | --- | --- |
    211 | `~true~` | ``=`true`=`` |
    212 | `^hello^` | `="hello"=` |
    213 
    214 #### Slots
    215 
    216 To allow for rearranging data, there is a way to store the output of a subex into a 'slot', which can then be referenced later.
    217 
    218 `<will_be_stored>$a` will store the output of `will_be_stored` into the slot `a`.
    219 These slots can be referenced in the output syntax so to output the contents of `a` we would use `=$a=`.
    220 As an example, to swap two atoms, we can use `.$a.=$a=`.
    221 Read: read the first atom and store it in slot `a`, then copy across the second atom, then output the content of slot `a` (which is the first atom).
    222 
    223 #### Repetition
    224 
    225 An extremely versatile feature of regexes and subexes is repetition.
    226 In subexes this is done with a postfix operator `{...}` where `...` defines the acceptable number of repetitions.
    227 Regex has an operator that 'repeats' either once or zero times, and is greedy so if once and zero times are both valid it will use once.
    228 In subex, this would look like `{1,0}`. The priority is ordered left to right so if we didn't want it to be greedy we would use `{0,1}`.
    229 If we wanted to repeat something as many times, up to a maximum of 5, we would use a range `{5-0}`.
    230 If we want unbounded repetition, we just leave out the maximum `{-0}`.
    231 
    232 #### Arithmetic
    233 
    234 Subexes can also do arithmetic by operating on the output of the subex before.
    235 
    236 | Syntax | Description |
    237 | --- | --- |
    238 | `+` | Sum |
    239 | `*` | Product |
    240 | `-` | Negate all atoms individually |
    241 | `/` | Take the reciprocal of all atoms individually |
    242 | `!` | Boolean NOT all atoms individually |
    243 
    244 These will all try to cast types to work as best they can.
    245 
    246 For example, to sum all atoms in the input and output the result, we use `.{-0}+`.
    247 We read in and copy out as many atoms as possible (all of them), and then we sum all that output.
    248 
    249 ### Commands
    250 
    251 With an understanding of subexes, we can look at the stred commands
    252 
    253 | Command | Description |
    254 | --- | --- |
    255 | `s/<subex>/` | Run `subex` on the value register. If it accepts, replace the value register with the output. If not skip the next command |
    256 | `S/<subex>/` | Same as `s` but on the path register |
    257 | `f/<subex>/` | Shorthand for `S/<subex>.{-0}/` |
    258 | `F/<subex>/` | Shorthand for `S/<subex>(.{-0}::)/` |
    259 | `l/<subex>/` | Shorthand for `S/.{0-}<subex>/` |
    260 | `L/<subex>/` | Shorthand for `S/.{0-}::<subex>/` |
    261 | `a/<subex>/` | Shorthand for `s/<subex>\|.{-0}/` |
    262 | `A/<subex>/` | Shorthand for `S/<subex>\|.{-0}/` |
    263 | `p` | Print whatever is in the value register |
    264 | `d` | Delete whatever is in the value register |
    265 | `D` | Delete whatever is in the path register |
    266 | `n` | Skip to the next token |
    267 | `N` | Load the next token and append it's value to the value register and replace the path register with the new token's path |
    268 | `o` | Do nothing |
    269 | `x` | Swap the value register with the X register |
    270 | `X` | Append the contents of the value register to the X register |
    271 | `y` | Swap the value register with the Y register |
    272 | `Y` | Append the contents of the value register to the Y register |
    273 | `z` | Swap the value register with the Z register |
    274 | `Z` | Append the contents of the value register to the Z register |
    275 | `k` | Swap the value register with the path register |
    276 | `K` | Append the contents of the value register to the path register |
    277 | `:` | Create a label. The character after `:` is used to label this point in the program, e.g. `:a` creates a label called `a` |
    278 | `b` | Branch to a label. The character after `b` is the label to branch to, e.g. `ba` branches to label `a` |