"Undershoot: parsing theory in 1965"

Rocky

2018-07-11 15:00:30 UTC

This and the Time Line miss an aspect that drives how parsing evolves: its
applications and technological advances/changes.

For example, with the advent of PEGs and its implementations, it was became
possible or easier to embed grammars into applications. These grammars
tended to be smaller than typically those used say in 1965.

We have more "dynamic" languages than in 1965 which consisted of perhaps
just LISP and APL (and possibly SNOBOL). The ability to hook in modules and
packages via packaging systems is if not greatly expanded, is in use more.
So there is more need for "little languages" which means more need for
grammars for them.

For those that haven't read Russ Cox's articles on regular expressions
<https://swtch.com/~rsc/regexp>, I heartily recommend them. And there's an
analog that I think occurs in the arena for parsing. With the advent of LEX
(some decades after 1965), it was taken for granted that you'd spend the
up-front preprocessing time to convert an NFA into a DFA. even if in
pathological cases it meant exponential time and space. That makes sense
for a compiler, but one of Russ' observations is that might not make sense
for the current technology or current uses of regular expressions.

And this is related to the mention of Knuth and O notation in Undershoot.
That O notation is too broad and you should look into the details a bit
more sometimes.

One of the things you can't do with a DFA that is easily done in a NFA is
parallelize it. Nowadays with multi-core processors and languages that
support multiple threads, it becomes more advantageous to use a NFA,
skipping the complexity and preprocessing time needed to build a DFA.
Leaving aside threading, If in contrast to a compiler, the parsing is to be
performed on a small amount of text, it also might not make sense to do
spend the setup time needed to build a DFA.

The analog situation is true in Parsing. Compare the simple and simple
minded Earley parser with a many {S,LA}LR parsers like say YACC. They too
spends a some time tracking down follow sets and figuring out how to
compact the resulting sparse tables.

As with DFA's, A LALR parsing processing is not parallizable as easily as
it is for an Earley parser. And in "little language" applications where
again large strings might not be parsed, it probably isn't worth the setup
time to "optimize" or improve/compact tables than to just leave things
simple and go with brute force.

Finally, the last thing that I feel is missing from the whole the timeline
is of course the aspect that I have most recently been interested in, the
decompiling aspect. Here grammars are used in an old way, sort of, but with
a new twist.

The Earley algorithm was originally proposed for human languages. I think
nowadays it is fair to say that it has largely been supplanted by Natural
Learning and Markov chain kinds of things. So often there deep grammars for
languages haven't been developed.

But after a bit of experience and reflection, the decompilation problem is
really more akin to the older natural language problem. And here what you
are trying to do is cover a set of strings you see in a grammar. It is okay
if the grammar accepts instruction sequences you'll never see. It just
needs to be correct on the instruction sequences that *do *appear. In this
scenario the Earley Algorithm is helpful in that it there is minimum
fussiness in having to deal with ambiguity, and that it can show you
several possibilities that can exist for a given sequence of instructions.
With this what you typically do is disallow derivations that would be wrong
and let the parser chose for itself among the remaining possibilities.

In the last paragraph, I've tried to justify why there is a reemergence for
the need for the Earley Algorithm even though its use is a little different
than what in the past it had been largely used for, if not developed for.
If you buy this need, then of making improvements to the algorithm via
memoization or parallelization is desirable.

Post by Jeffrey Kegler
"Undershoot: parsing theory in 1965"
<http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2018/07/knuth_1965_2.html>
is my newest blog post. It revisits the question of why parsing is
considered by many, especially in academia, to be a solved problem. A lot
of the background that I had to omit for brevity in previous posts is
included.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.