2019-12-10 J parsing using a sequential machine

In a comment on my last post about the J programming language, Peter Kotrčka mentioned a flaw in the simple parsing trick I had used. I was basically assuming that the numbers in my input were separated by exactly one newline (or other J “word”).

=> J programming language | Peter Kotrčka

I had spent some time trying to implement a parser that just looks for digits and ignores everything else, but failed. There’s an example of a sequential machine in their help pages, but I couldn’t get it to simply emit a list of numbers.

=> sequential machine

Peter made me return to that parser and I think I finally understood how it works! Thanks. 🙂

First, create an array mapping every input byte to a code. In this case, we create an array with 256 zeroes, and then we set the code to 1 for every digit.

m=: 256$0
m=: 1 (a.i.'0123456789')}m

The result looks something like this: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... – as you can see there are a bunch of ones in there. 🙂

Next we need a state transition table. It has 3 dimensions.

the first axis is the current state we’re in: 0 means we’re skipping stuff, 1 means we’re reading a number – we don’t need any more states

the second axis is the current code we’re looking at: 0 means we’re looking at some byte that’s not a digit, 1 means we’re looking at a digit (this is where m is used)

the third axis is simply two integers: the first one is the new state (we only have 0 or 1 to choose from), and the second one is what to do (the output code: 0 means doing nothing, 1 means start a word, 3 means emit a word and reset)

OK, so here’s how I built it:

NB.        state 0  state 1
s=: 2 2 2$ 0 0 1 1  0 3 1 0

Or graphically:

<"1 s
┌───┬───┐
│0 0│1 1│
├───┼───┤
│0 3│1 0│
└───┴───┘

Rows are the current state, columns are the input code we’re looking at.

Can you see it? I started thinking about it like this:

given a string such as 42 16, starting at index 0, in state 0, with j being the beginning of a word pointing at -1 (in other words, no word is started) ...

we look at ’4’ and get the new state from our m: 1 (as we’re looking at a digit)

given this information, we need to look at the cell [1 1] (current state 0, input code 1)

the new state is set to 1, and output code 1 means we begin a new word at the current index 0

we look at ’2’ and get the new state from our m: still 1

now we’re looking at the cell [1 0] (current state 1, input code 1)

the new state continues set to 1, and output code 0 means nothing happens

we look at the space character and get the new state from our m: 0

now we’re looking at the cell [0 3] (current state 1, input code 0)

the new state changes to 0, and output code 3 means we emit a word, starting from index 0 (that’s when we started it) up to the current position ("42") and then we set j to -1 again

And so on.

The output code 3 means we emit a word and we do not begin a new word. This is important at the end: we could have used the output code 2 but that means we emit a word and begin a new one (j remains set), and the result is that any trailing garbage at the end is turned into a word.

And now the parser works for everything:

(0;s;m) ;: '42x16xx1'
┌──┬──┬─┐
│42│16│1│
└──┴──┴─┘

You might be wondering about the 0 at the beginning of that parser definition. The answer is that this tells the parser what to do with the words it finds. 0 means that the words end up in boxes. 2 means the word index and length, for example:

(2;s;m) ;: '42x16xx1'
0 2
3 2
7 1

At last I understand it!

#J #Programming

Comments

(Please contact me if you want to remove your comment.)

⁂

I installed j701 on the App Store.

=> j701

I was also happy to learn:

J is written in portable C and is available for Windows, Linux, Mac, iOS, Android and Raspberry Pi. J can be installed and distributed for free. The source is provided under both commercial and GPL 3 licenses.

=> source

– Alex Schroeder 2019-12-13 00:31 UTC

When installing j901 on Debian, I had to install the following:

libqt5core5a
libqt5webkit5
libqt5websockets5
libqt5multimediawidgets5

I don’t know what I should have installed instead.

– Alex Schroeder 2019-12-13 17:55 UTC

Proxy Information

Original URL: gemini://alexschroeder.ch/2019-12-10_J_parsing_using_a_sequential_machine
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 165.295322 milliseconds
Gemini-to-HTML Time: 1.065545 milliseconds

This content has been proxied by September (3851b).