From ed32e06578bdc390a2c37a9e915565adc9a70cba Mon Sep 17 00:00:00 2001 From: Charlie Stanton Date: Tue, 25 Apr 2023 13:18:35 +0100 Subject: Adds a draft README --- README.md | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..57fd3a3 --- /dev/null +++ b/README.md @@ -0,0 +1,166 @@ +# stred +## Streaming Tree Editor + +This README is a work in progress. It is not very clear and is also probably out of date + +Stred is like sed but for tree data structures (currently only JSON). +It has a lot in common with the pattern: `gron | sed | ungron`. + +An input JSON file is decoded into lexical tokens, and then a program is run on each of these tokens. +These programs are simply a list of commands. + +## Why? + +jq is alright, but it's a full on language, it's massive. +stred aims to be much more like a tool than a language, whether it succeeds in that aim, we'll see. +I'm hoping it will be capable of nearly as much as jq, with comparable or better performance, but far easier to learn and use. + +## Usage + +### Registers and Atoms +Commands operate on data that is store in 'registers'. There are 3 registers: path, value and X. + +The path register stores the global path from the root node to the current token, the value register stores the token itself and the X register is a general purpose register, good for keeping things for later. + +With the following JSON being read: +``` +{ + "nest1": { + "nest2": "data" + } +} +``` +When the `"data"` token is read, the path register will be set to `"nest1""nest2"` and the value register will be set to `"data"`. + +Other than path and value being automatically updated with each value, they all function identically. Each of them stores a list of 'atoms'. +For simple tokens (null, booleans, numbers and the starts and ends of objects and arrays) the tokens are themselves atoms. To better deal with strings though, they are split into smaller atoms. +A string is comprised of a StringTerminal atom, usually written `"`, then an atom for each unicode codepoint (character) in the string, and finally another `"` atom. +This is why the path I showed earlier was `"nest1""nest2"` instead of something like `["nest1", "nest2"]`. + +### Substitutions + +The substitution commands are the most important by far. +The substitution command `s` is always followed by a 'subex', usually wrapped in `/`. +For example, to substitute a single atom for itself, we would use the command `s/./`. +`.` is the subex for read in an atom and output it unchanged, so it's all we need. +The `s` command makes changes to the value register based on the subex, but as well as transforming the input into an output, a subex will also either accept or reject just like a traditional regex. +The command immediately after a substitution will only be run if the subex accepts. +`p` is the print command, and the `-n` flag will disable the default behaviour of stred to print by default, so if we want to only print tokens with an empty path (i.e. that are part of the root node), we use `S//p`. +Notice uppercase `S` is used to apply the substitution to the path register instead of the value register. +Running `stred -n 'S//p'` on the JSON above would simply output `{}`, as the root tokens are just beginning and ending the object. + +### Substitute Expressions + +The power of stred comes from substitute expressions, or subexes for short. +They operate on a list of atoms, and will either produce a new list of atoms, or reject. + +#### Basic subexes + +The simplest subexes are literals. These just copy directly from the input to the output. + +| Syntax | Description | +| --- | --- | +| `.` | Copy across any single atom unchanged | +| `,` | Copy across any single value unchanged (will copy a whole string) | +| `?` | Copy across any single boolean atom | +| `%` | Copy across any single number | +| `_` | Copy across a single unicode codepoint inside a string | +| `"` | Copy across a string terminal (start or end of a string) | +| `#` | Copy across a string value. Equivalent to `"_{-0}"` | +| `\`\`` | Copy across a specific boolean (true/false), a specific number or a specific object/array terminal | +| `` | Copy across the specific character | +| `[a-z\`null\`]` | Will match and copy across any of the specified literals. A `-` can be used to create a range of characters | + +If the input doesn't match then the subex rejects. +For example, `a` copies across specifically the character `a` inside a string, anything else will reject. +Literals like `%` will copy across any number but will reject an atom like `a` that is not a number. + +#### Concatenation and Alternation + +Writing a series of subexes one after another will concatenate them. +This means that the first reads some atoms in and writes some out, then the second etc. +The first subex effectively splits the input into a chunk that it will transform, and a chunk for everything else to transform which gets passed over to the second subex. +If either reject, the whole subex rejects. + +Putting `|` between two subexes will try two options, so `"yes"|"no"` will accept and copy across either the string `"yes"` or the string `"no"` but nothing else. + +#### Replacement subexes + +The above are good for matching, but don't allow for transformation, so we can also add and remove with subexes. + +| Syntax | Description | +| --- | --- | +| `==` | Accepts no input but outputs `content`, which is made up of specific literals | +| `::` | Runs `drop`, accepting input as it does, then discards any output and outputs `content` instead | +| `[a-z=A-Z]` | Accepts one of the characters from left of `=` as input and outputs a corresponding character from the right as output. If there are more input options than output, the output range is looped | + +A useful construct is `[\`{\`\`}\`=\`[\`\`]\`]` which will replace an object terminal with the corresponding array terminal. +To make using these literals easier, they can be listed inside a single pair of `\``. +`\`1 2 {}null\`` is equivalent to `\`1\`\`2\`\`{\`\`}\`\`null\``. + +There a few other helpful shortcuts for the output syntax + +| Syntax | Equivalent | +| --- | --- | +| `~true~` | `=\`true\`=` | +| `^hello^` | `="hello"=` | + +#### Slots + +To allow for rearranging data, there is a way to store the output of a subex into a 'slot', which can then be referenced later. + +`$a` will store the output of `will_be_stored` into the slot `a`. +These slots can be referenced in the output syntax so to output the contents of `a` we would use `=$a=`. +As an example, to swap two atoms, we can use `.$a.=$a=`. +Read: read the first atom and store it in slot `a`, then copy across the second atom, then output the content of slot `a` (which is the first atom). + +#### Repetition + +An extremely versatile feature of regexes and subexes is repetition. +In subexes this is done with a postfix operator `{...}` where `...` defines the acceptable number of repetitions. +Regex has an operator that 'repeats' either once or zero times, and is greedy so if once and zero times are both valid it will use once. +In subex, this would look like `{1,0}`. The priority is ordered left to right so if we didn't want it to be greedy we would use `{0,1}`. +If we wanted to repeat something as many times, up to a maximum of 5, we would use a range `{5-0}`. +If we want unbounded repetition, we just leave out the maximum `{-0}`. + +#### Arithmetic + +Subexes can also do arithmetic by operating on the output of the subex before. + +| Syntax | Description | +| --- | --- | +| `+` | Sum | +| `*` | Product | +| `-` | Negate all atoms individually | +| `/` | Take the reciprocal of all atoms individually | +| `!` | Boolean NOT all atoms individually | + +These will all try to cast types to work as best they can. + +For example, to sum all atoms in the input and output the result, we use `.{-0}+`. +We read in and copy out as many atoms as possible (all of them), and then we sum all that output. + +### Commands + +With an understanding of subexes, we can look at the stred commands + +| Command | Description | +| --- | --- | +| `s//` | Run `subex` on the value register. If it accepts, replace the value register with the output. If not skip the next command | +| `S//` | Same as `s` but on the path register | +| `f//` | Shorthand for `S/.{-0}/` | +| `F//` | Shorthand for `S/(.{-0}::)/` | +| `l//` | Shorthand for `S/.{0-}/` | +| `L//` | Shorthand for `S/.{0-}::/` | +| `a//` | Shorthand for `s/|.{-0}` | +| `A//` | Shorthand for `S/|.{-0}` | +| `p` | Print whatever is in the value register | +| `d` | Delete whatever is in the value register | +| `D` | Delete whatever is in the path register | +| `n` | Skip to the next token | +| `N` | Load the next token and append it's value to the value register and replace the path register with the new token's path | +| `o` | Do nothing | +| `x` | Swap the value register with the X register | +| `X` | Append the contents of the value register to the X register | +| `k` | Swap the value register with the path register | +| `K` | Append the contents of the value register to the path register | -- cgit v1.2.3