Parsing

Parsing steps

The goal of the Lawtext parser is to extract the law content that the original Lawtext expresses and obtain it as StdEL. The parser processes the Lawtext in three steps to achieve that goal:

  1. Parse the Lawtext into a CST (Concrete Syntax Tree). The CST is a sequence of Lines.
  2. Analyze the indentation of Lines and insert indent (one increase of indent level) / dedent (one decrease of indent level) markers between the lines. The output of this step is a sequence of VirtualLines.
  3. Parse the VirtualLines into a StdEL tree.

You can find the entry point of the Lawtext parser at (core/src/parser/)lawtext.parse().

Syntax trees: Lines and VirtualLines

The Lawtext CST (Concrete Syntax Tree) is a sequence of Lines (library documentation here). A Line represents a physical line in the Lawtext string and contains all detailed features in the original string (not included in the resulting StdEL tree), such as verbose blank lines, meaningless whitespaces, and physical position (line and character index).

Instead of directly parsing the Lawtext into StdEL, we adopted the intermediate CST representation (Lines) for the following reasons.

  • We can simplify the following process of parsing Lines into a StdEL tree because the Lawtext syntax has a line-based structure.
  • We can use the same CST to render the law into Lawtext and format Lawtext.
  • We need detailed position information to provide a language server in source code editors such as Visual Studio Code.

During parsing the CST into a StdEL tree, the parser first analyzes the structure of indented blocks (indent levels) and inserts indent (one increase of indent level) / dedent (one decrease of indent level) markers between the lines. The resulting structure is a sequence of VirtualLines (library documentation here), and the main part of parsing is conducted on the sequence of VirtualLines.

We made the indentation analysis separate from building CST because the virtual (logical) indent levels in Lawtext do not always match the physical (visual) indent levels. For example, the article caption line in the Lawtext sample above (␣␣(目的等)) has one full-width (two ASCII-width) whitespace indentation. This whitespace complies with the layout rule of traditional Japanese law texts for visual compatibility. However, the article caption line does not make a new indentation level, and we have to consider the context of the lines to understand the line is an article caption, not a part of indented lines. Therefore, the Lawtext parser first parses the physical indent levels (and output Lines) and then detects the virtual indent levels (and output VirtualLines).

Parsing algorithm

For steps 1 and 3, the main process of parsing is recursive descent parsing of a string (a sequence of characters) and VirtualLines, respectively. We utilize a generic recursive descent parser generic-parser (opens in a new tab) for both steps.

How to try

Try it out here

  1. On this page, open the browser console (for Chrome, press Ctrl+Shift+J (Windows/Linux) or Cmd+Opt+J (Mac)).
  2. Run the following command in the browser console:
    lawtext.run({
        input: { elaws: "405AC0000000088" },
        outtypes: ["lawtext"],
    })
        .then(r => {
            console.log("Lawtext:");
            console.log(r.lawtext);
            return lawtext.run({
                input: { lawtext: r.lawtext },
                outtypes: ["json"],
                controlel: true,
            });
        })
        .then(r => {
            console.log("\u{2705} Parsed VirtualLines:");
            console.log(r.virtuallines);
            console.table(r.virtuallines.map(v => ({
                "VirtualLine.type": v.type,
                ...Object.fromEntries(v.line?.rangeTexts().map(([, t, d], i) => [`rangeTexts()[${i}]`, t]) ?? []),
            })));
            console.log("\u{2705} Parsed StdEL (JsonEL):");
            console.log(r.json);
            console.log(JSON.stringify(r.json, undefined, 2));
        });

Hint: run console.log(lawtext.run.help) to show the help.

Try using the CLI

Please see CLI usage and run CLI with a lawtext input and json output with controlel option (and virtuallinesout option when running in Node.js) to get the parsed VirtualLines and the parsed StdEL (JsonEL) representation.

Try using the Visual Studio Code extension