Discussion:
Handling long lists
Rocky
2018-06-22 11:58:44 UTC
Permalink
Here's a parsing odditiy that came up in a doing this Python decompiler,
and I wonder if there's a good solution for it.

Python bytecode like a number of other bytecode is polish prefix so the
operator comes after all of the operands are seen.
To handle long lists of data in Python you can write

x = [1, 2, 3, 10, 20, ... ]


and the bytecode in some versions of Python might be

LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000

In pathelogical situations these lists can get arbitrarily long.

The build list has a count of how many operands it expects which above I
fold into the nonterminal. There is a grammar rule that gets created at
runtime:

expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000

Finally I get to the problem: In an early algorithm parser, to recognize
this there would be 1000 states just for each place where the dot could be
in parsing build_list_1000. How would reduce the number of intermediate
states needed in parsing?


In theory there's always a solution using procedural parsing, but this can
get quite involved I think. Above the expressions were constants simple
constants but in practice and expression could be an aribitrary expression
which is somehting grammars are good at capturing and that I'd not want to
have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rocky
2018-06-22 12:09:25 UTC
Permalink
To be clear on one thing: it's not simply that there are 1,000 states at
before the reduction, but that at each scan step, all of them get updated.
So we have quadratic behaviour: after seeing second expr we update 1 prior
state, then on seeing another expr we update 2 prior states, and then 3
prior states and so on. So in general we'll have n*(n-2)/2 updates. ick.
Post by Rocky
Here's a parsing odditiy that came up in a doing this Python decompiler,
and I wonder if there's a good solution for it.
Python bytecode like a number of other bytecode is polish prefix so the
operator comes after all of the operands are seen.
To handle long lists of data in Python you can write
x = [1, 2, 3, 10, 20, ... ]
and the bytecode in some versions of Python might be
LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000
In pathelogical situations these lists can get arbitrarily long.
The build list has a count of how many operands it expects which above I
fold into the nonterminal. There is a grammar rule that gets created at
expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000
Finally I get to the problem: In an early algorithm parser, to recognize
this there would be 1000 states just for each place where the dot could be
in parsing build_list_1000. How would reduce the number of intermediate
states needed in parsing?
In theory there's always a solution using procedural parsing, but this can
get quite involved I think. Above the expressions were constants simple
constants but in practice and expression could be an aribitrary expression
which is somehting grammars are good at capturing and that I'd not want to
have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Lukas Atkinson
2018-06-22 12:30:11 UTC
Permalink
Hi Rocky,

I don't think there's a way to avoid all those parser states.

However, trying to parse a BUILD_LIST_1000 rule seems very unusual to me,
since you'd have to create a BUILD_LIST<N> rule for all N. It's clearly not
workable to spell out all those rules in the grammar. But you can't
generalize to a pseudo-BNF rule like build_list(N) ::= <expr>^N
"BUILD_LIST_"<N> because that is not a context-free grammar.

Trying to directly parse an assembly-like notation back into an AST-like
structure is not a normal use for a parser, and you will have difficulties
with things like jumps, loops, and dup instructions. This is a case were
not using Marpa and hand-rolling a program will be much easier (possibly
using Marpa to parse the disassembled opcodes into a data structure, but
that sounds like overkill). This program would mostly manipulate AST
fragments on a stack, and can easily deal with variable-arity operations
like BUILD_LIST_N (by popping the top n expressions from the output stack).
Post by Rocky
To be clear on one thing: it's not simply that there are 1,000 states at
before the reduction, but that at each scan step, all of them get updated.
So we have quadratic behaviour: after seeing second expr we update 1 prior
state, then on seeing another expr we update 2 prior states, and then 3
prior states and so on. So in general we'll have n*(n-2)/2 updates. ick.
Post by Rocky
Here's a parsing odditiy that came up in a doing this Python decompiler,
and I wonder if there's a good solution for it.
Python bytecode like a number of other bytecode is polish prefix so the
operator comes after all of the operands are seen.
To handle long lists of data in Python you can write
x = [1, 2, 3, 10, 20, ... ]
and the bytecode in some versions of Python might be
LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000
In pathelogical situations these lists can get arbitrarily long.
The build list has a count of how many operands it expects which above I
fold into the nonterminal. There is a grammar rule that gets created at
expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000
Finally I get to the problem: In an early algorithm parser, to recognize
this there would be 1000 states just for each place where the dot could be
in parsing build_list_1000. How would reduce the number of intermediate
states needed in parsing?
In theory there's always a solution using procedural parsing, but this
can get quite involved I think. Above the expressions were constants simple
constants but in practice and expression could be an aribitrary expression
which is somehting grammars are good at capturing and that I'd not want to
have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rocky Bernstein
2018-06-22 12:47:02 UTC
Permalink
Post by Lukas Atkinson
Hi Rocky,
I don't think there's a way to avoid all those parser states.
However, trying to parse a BUILD_LIST_1000 rule seems very unusual to me,
since you'd have to create a BUILD_LIST<N> rule for all N.
I first scan opcodes for those that list the number of operands they have,
then custom grammar rules are
created for those.

It's clearly not workable to spell out all those rules in the grammar. But
Post by Lukas Atkinson
you can't generalize to a pseudo-BNF rule like build_list(N) ::= <expr>^N
"BUILD_LIST_"<N> because that is not a context-free grammar.
You're thinking about this probably from a different but historically more
prevalent perspective. I'm building a custom grammar based on the input
that happens to appear. I'm taking advantage of one of the beautiful things
about PEG grammars - that they can be created and run at run time and
therefore tailored to specific use cases.
Post by Lukas Atkinson
Trying to directly parse an assembly-like notation back into an AST-like
structure is not a normal use for a parser, and you will have difficulties
with things like jumps, loops, and dup instructions.
Although I need to redo how control flow is done, in fact there has been a
decompiler for Python for a decade and half which works off of an Earley
algorithm parser. The deparser that I've taken over currently works for
python from verison 1.3 circa 1995 to the present 3.7. If one were always
to stay within "normal use", progress would be serioulsy impeded. I'm
imagining someone telling Benjamin Franklin that flying a kit in a
thunderstorm is not the "normal use" of Electricity.

At any rate you can read about this novel use of deparsing in
http://rocky.github.io/Deparsing-Paper.pdf


This is a case were not using Marpa and hand-rolling a program will be much
Post by Lukas Atkinson
easier (possibly using Marpa to parse the disassembled opcodes into a data
structure, but that sounds like overkill). This program would mostly
manipulate AST fragments on a stack, and can easily deal with
variable-arity operations like BUILD_LIST_N (by popping the top n
expressions from the output stack).
Actually, that's what the competing deparsers do. They are not as good
though :-)
Post by Lukas Atkinson
Post by Rocky
To be clear on one thing: it's not simply that there are 1,000 states at
before the reduction, but that at each scan step, all of them get updated.
So we have quadratic behaviour: after seeing second expr we update 1 prior
state, then on seeing another expr we update 2 prior states, and then 3
prior states and so on. So in general we'll have n*(n-2)/2 updates. ick.
Post by Rocky
Here's a parsing odditiy that came up in a doing this Python decompiler,
and I wonder if there's a good solution for it.
Python bytecode like a number of other bytecode is polish prefix so the
operator comes after all of the operands are seen.
To handle long lists of data in Python you can write
x = [1, 2, 3, 10, 20, ... ]
and the bytecode in some versions of Python might be
LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000
In pathelogical situations these lists can get arbitrarily long.
The build list has a count of how many operands it expects which above I
fold into the nonterminal. There is a grammar rule that gets created at
expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000
Finally I get to the problem: In an early algorithm parser, to
recognize this there would be 1000 states just for each place where the dot
could be in parsing build_list_1000. How would reduce the number of
intermediate states needed in parsing?
In theory there's always a solution using procedural parsing, but this
can get quite involved I think. Above the expressions were constants simple
constants but in practice and expression could be an aribitrary expression
which is somehting grammars are good at capturing and that I'd not want to
have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/marpa-parser/CExOShs6khU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jeffrey Kegler
2018-06-22 13:11:44 UTC
Permalink
@Rocky: This is interesting to me. I've printed out your article and will
read it.

Lukas is right that natively Marpa (and just about any other parser) has
trouble with the "reverse Hollerith code" for lists. It actually makes the
grammar context-sensitive. But I think I might have some interesting ideas
about how to deal with this using Marpa,

I'm taking advantage of one of the beautiful things about PEG grammars -
that they can be created and run at run time and therefore tailored to
specific use cases.
Marpa can do this and is actually much better at it. Any grammar you enter
you get is parsed correctly, and any LR-regular grammar is parsed in linear
time.

I get a good deal of heat (and I suspect many reddit downvotes) for
pointing out that PEG only correctly implements a grammar if it is LL(1).
Go beyond that, and you get a subset of the language that the BNF describes.

Congratulations btw on seeing the power of 2nd order languages. Perhaps
you'll try them out with Marpa and will find them much easier and safer.
On Fri, Jun 22, 2018 at 8:30 AM, Lukas Atkinson <
Post by Lukas Atkinson
Hi Rocky,
I don't think there's a way to avoid all those parser states.
However, trying to parse a BUILD_LIST_1000 rule seems very unusual to me,
since you'd have to create a BUILD_LIST<N> rule for all N.
I first scan opcodes for those that list the number of operands they have,
then custom grammar rules are
created for those.
It's clearly not workable to spell out all those rules in the grammar. But
Post by Lukas Atkinson
you can't generalize to a pseudo-BNF rule like build_list(N) ::=
<expr>^N "BUILD_LIST_"<N> because that is not a context-free grammar.
You're thinking about this probably from a different but historically more
prevalent perspective. I'm building a custom grammar based on the input
that happens to appear. I'm taking advantage of one of the beautiful things
about PEG grammars - that they can be created and run at run time and
therefore tailored to specific use cases.
Post by Lukas Atkinson
Trying to directly parse an assembly-like notation back into an AST-like
structure is not a normal use for a parser, and you will have difficulties
with things like jumps, loops, and dup instructions.
Although I need to redo how control flow is done, in fact there has been a
decompiler for Python for a decade and half which works off of an Earley
algorithm parser. The deparser that I've taken over currently works for
python from verison 1.3 circa 1995 to the present 3.7. If one were always
to stay within "normal use", progress would be serioulsy impeded. I'm
imagining someone telling Benjamin Franklin that flying a kit in a
thunderstorm is not the "normal use" of Electricity.
At any rate you can read about this novel use of deparsing in
http://rocky.github.io/Deparsing-Paper.pdf
This is a case were not using Marpa and hand-rolling a program will be
Post by Lukas Atkinson
much easier (possibly using Marpa to parse the disassembled opcodes into a
data structure, but that sounds like overkill). This program would mostly
manipulate AST fragments on a stack, and can easily deal with
variable-arity operations like BUILD_LIST_N (by popping the top n
expressions from the output stack).
Actually, that's what the competing deparsers do. They are not as good
though :-)
Post by Lukas Atkinson
Post by Rocky
To be clear on one thing: it's not simply that there are 1,000 states at
before the reduction, but that at each scan step, all of them get updated.
So we have quadratic behaviour: after seeing second expr we update 1 prior
state, then on seeing another expr we update 2 prior states, and then 3
prior states and so on. So in general we'll have n*(n-2)/2 updates. ick.
Post by Rocky
Here's a parsing odditiy that came up in a doing this Python
decompiler, and I wonder if there's a good solution for it.
Python bytecode like a number of other bytecode is polish prefix so the
operator comes after all of the operands are seen.
To handle long lists of data in Python you can write
x = [1, 2, 3, 10, 20, ... ]
and the bytecode in some versions of Python might be
LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000
In pathelogical situations these lists can get arbitrarily long.
The build list has a count of how many operands it expects which above
I fold into the nonterminal. There is a grammar rule that gets created at
expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000
Finally I get to the problem: In an early algorithm parser, to
recognize this there would be 1000 states just for each place where the dot
could be in parsing build_list_1000. How would reduce the number of
intermediate states needed in parsing?
In theory there's always a solution using procedural parsing, but this
can get quite involved I think. Above the expressions were constants simple
constants but in practice and expression could be an aribitrary expression
which is somehting grammars are good at capturing and that I'd not want to
have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google
Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/marpa-parser/CExOShs6khU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rocky
2018-06-24 00:20:13 UTC
Permalink
Post by Jeffrey Kegler
@Rocky: This is interesting to me. I've printed out your article and will
read it.
Lukas is right that natively Marpa (and just about any other parser) has
trouble with the "reverse Hollerith code" for lists. It actually makes the
grammar context-sensitive. But I think I might have some interesting ideas
about how to deal with this using Marpa,
Ah yes,... I had forgotten about Hollerith codes in FORTRAN and how it has
long been used as an example of something that is not context free. .
Post by Jeffrey Kegler
I'm taking advantage of one of the beautiful things about PEG grammars -
that they can be created and run at run time and therefore tailored to
specific use cases.
Marpa can do this and is actually much better at it. Any grammar you
enter you get is parsed correctly, and any LR-regular grammar is parsed in
linear time.
One thing weird about my particular use case is that because I'm handling
basically 20 different grammars in a hierarchically OO way (to reduce
duplication), many of which are only slightly different, the parsing tool I
have also has a way of *removing *grammar rules. And it has a way of
profiling grammar coverage.

How does Marpa do here? (Providing grammar profiling is in fact pretty
trivial to add).
Post by Jeffrey Kegler
I get a good deal of heat (and I suspect many reddit downvotes) for
pointing out that PEG only correctly implements a grammar if it is LL(1).
Go beyond that, and you get a subset of the language that the BNF describes.
Congratulations btw on seeing the power of 2nd order languages. Perhaps
you'll try them out with Marpa and will find them much easier and safer.
Actually this reminds me of a big favor I'd like to ask of someone who uses
Marpa.

Just a little background first to motivate this....

I'll be giving a talk at the Glasgow YAPC 2018, really to promote this idea
of decompiling at runtime to provide better position information.

I somewhat arbitrarily broke the topic up into two parts: describing B::DeparseTree
<https://metacpan.org/pod/B::DeparseTree>and showing how it can be used in
a gdb-like debugger I wrote for Perl Devel::Trepan
<https://metacpan.org/pod/Devel::Trepan>.

As part of an incentive to have a talk presented, I often beef up the code
for whatever I'm going to talk about. And for Devel::Trepan I thought that
a cool addition would be to
replace a lot of hacky code I have certain debugger commands and use Marpa
to simplify some of that. But given the amount of time I have spent and
will have to spend on B::DeparseTree see Rewriting B:Deparse and
Reintroducing B::DeparseTree and (part 1)
<http://blogs.perl.org/users/rockyb/2018/06/introducing-bdeparsetree-and-rewriting-bdeparse-part-1.html>,
I don't think I'll have time to work much on Devel::Trepan. Here's what I
had been
planning to do if I had more time, and I hope someone might help me out
here..

A gdb debugger like Devel::Trepan has (or should have) a notion of a
"location"; these are used in gdb "list" and "breakpoint" commands. But
building on that, are the concepts of a list range (in the "list"
command), and a breakpoint location (in the "break" command - it adds an
optional condition). These are all well defined in gdb.

In the corresponding Python gdb-like debugger, I have already done this. So
I already have grammars for these
<https://github.com/rocky/python3-trepan/blob/master/trepan/processor/parse/parser.py#L104-L146>
.

There is one deviation I make from gdb parsing. Rather than use some sort
of semantic lookup to distinguish between a function name and a file path,
I require the former be suffixed with "()". Having used this for a while it
really isn't a burden to add the () when needed and it makes things simpler
and cleaner looking.

I'd be eternally grateful if someone would volunteer to write the Marpa
equivalent grammar and enough of the semantics to capture the result
information in some sort of structure. To make things easier you don't have
to integrate this into Devel::Trepan if you don't want, it can just be a
standalone program. But if you want to add it it that's fine too ;-)


Having this done would cement Marpa and Devel::Trepan together, and I can
make a plug for it at YAPC 2018. (I was going to make a plug for it, anyway
as something I had intended to do. Of course it is better show that it's
been done rather than mention it as a todo item.)

Thanks everyone.
Post by Jeffrey Kegler
On Fri, Jun 22, 2018 at 8:30 AM, Lukas Atkinson <
Post by Lukas Atkinson
Hi Rocky,
I don't think there's a way to avoid all those parser states.
However, trying to parse a BUILD_LIST_1000 rule seems very unusual to
me, since you'd have to create a BUILD_LIST<N> rule for all N.
I first scan opcodes for those that list the number of operands they
have, then custom grammar rules are
created for those.
It's clearly not workable to spell out all those rules in the grammar.
Post by Lukas Atkinson
But you can't generalize to a pseudo-BNF rule like build_list(N) ::=
<expr>^N "BUILD_LIST_"<N> because that is not a context-free grammar.
You're thinking about this probably from a different but historically
more prevalent perspective. I'm building a custom grammar based on the
input that happens to appear. I'm taking advantage of one of the beautiful
things about PEG grammars - that they can be created and run at run time
and therefore tailored to specific use cases.
Post by Lukas Atkinson
Trying to directly parse an assembly-like notation back into an AST-like
structure is not a normal use for a parser, and you will have difficulties
with things like jumps, loops, and dup instructions.
Although I need to redo how control flow is done, in fact there has been
a decompiler for Python for a decade and half which works off of an Earley
algorithm parser. The deparser that I've taken over currently works for
python from verison 1.3 circa 1995 to the present 3.7. If one were always
to stay within "normal use", progress would be serioulsy impeded. I'm
imagining someone telling Benjamin Franklin that flying a kit in a
thunderstorm is not the "normal use" of Electricity.
At any rate you can read about this novel use of deparsing in
http://rocky.github.io/Deparsing-Paper.pdf
This is a case were not using Marpa and hand-rolling a program will be
Post by Lukas Atkinson
much easier (possibly using Marpa to parse the disassembled opcodes into a
data structure, but that sounds like overkill). This program would mostly
manipulate AST fragments on a stack, and can easily deal with
variable-arity operations like BUILD_LIST_N (by popping the top n
expressions from the output stack).
Actually, that's what the competing deparsers do. They are not as good
though :-)
Post by Lukas Atkinson
Post by Rocky
To be clear on one thing: it's not simply that there are 1,000 states
at before the reduction, but that at each scan step, all of them get
updated. So we have quadratic behaviour: after seeing second expr we update
1 prior state, then on seeing another expr we update 2 prior states, and
then 3 prior states and so on. So in general we'll have n*(n-2)/2 updates.
ick.
Post by Rocky
Here's a parsing odditiy that came up in a doing this Python
decompiler, and I wonder if there's a good solution for it.
Python bytecode like a number of other bytecode is polish prefix so
the operator comes after all of the operands are seen.
To handle long lists of data in Python you can write
x = [1, 2, 3, 10, 20, ... ]
and the bytecode in some versions of Python might be
LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000
In pathelogical situations these lists can get arbitrarily long.
The build list has a count of how many operands it expects which above
I fold into the nonterminal. There is a grammar rule that gets created at
expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000
Finally I get to the problem: In an early algorithm parser, to
recognize this there would be 1000 states just for each place where the dot
could be in parsing build_list_1000. How would reduce the number of
intermediate states needed in parsing?
In theory there's always a solution using procedural parsing, but this
can get quite involved I think. Above the expressions were constants simple
constants but in practice and expression could be an aribitrary expression
which is somehting grammars are good at capturing and that I'd not want to
have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google
Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "marpa parser" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/marpa-parser/CExOShs6khU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jeffrey Kegler
2018-06-24 13:46:12 UTC
Permalink
Is anyone looking at this? At first glance it seems straightforward.

If nobody else looks at it, I might take time away from looking at
Marpa-driven combinator parsing and tackle it.
Post by Rocky
Post by Jeffrey Kegler
@Rocky: This is interesting to me. I've printed out your article and
will read it.
Lukas is right that natively Marpa (and just about any other parser) has
trouble with the "reverse Hollerith code" for lists. It actually makes the
grammar context-sensitive. But I think I might have some interesting ideas
about how to deal with this using Marpa,
Ah yes,... I had forgotten about Hollerith codes in FORTRAN and how it has
long been used as an example of something that is not context free. .
Post by Jeffrey Kegler
I'm taking advantage of one of the beautiful things about PEG grammars
- that they can be created and run at run time and therefore tailored to
specific use cases.
Marpa can do this and is actually much better at it. Any grammar you
enter you get is parsed correctly, and any LR-regular grammar is parsed in
linear time.
One thing weird about my particular use case is that because I'm handling
basically 20 different grammars in a hierarchically OO way (to reduce
duplication), many of which are only slightly different, the parsing tool I
have also has a way of *removing *grammar rules. And it has a way of
profiling grammar coverage.
How does Marpa do here? (Providing grammar profiling is in fact pretty
trivial to add).
Post by Jeffrey Kegler
I get a good deal of heat (and I suspect many reddit downvotes) for
pointing out that PEG only correctly implements a grammar if it is LL(1).
Go beyond that, and you get a subset of the language that the BNF describes.
Congratulations btw on seeing the power of 2nd order languages. Perhaps
you'll try them out with Marpa and will find them much easier and safer.
Actually this reminds me of a big favor I'd like to ask of someone who
uses Marpa.
Just a little background first to motivate this....
I'll be giving a talk at the Glasgow YAPC 2018, really to promote this
idea of decompiling at runtime to provide better position information.
I somewhat arbitrarily broke the topic up into two parts: describing B::DeparseTree
<https://metacpan.org/pod/B::DeparseTree>and showing how it can be used
in a gdb-like debugger I wrote for Perl Devel::Trepan
<https://metacpan.org/pod/Devel::Trepan>.
As part of an incentive to have a talk presented, I often beef up the code
for whatever I'm going to talk about. And for Devel::Trepan I thought that
a cool addition would be to
replace a lot of hacky code I have certain debugger commands and use Marpa
to simplify some of that. But given the amount of time I have spent and
will have to spend on B::DeparseTree see Rewriting B:Deparse and
Reintroducing B::DeparseTree and (part 1)
<http://blogs.perl.org/users/rockyb/2018/06/introducing-bdeparsetree-and-rewriting-bdeparse-part-1.html>,
I don't think I'll have time to work much on Devel::Trepan. Here's what I
had been
planning to do if I had more time, and I hope someone might help me out
here..
A gdb debugger like Devel::Trepan has (or should have) a notion of a
"location"; these are used in gdb "list" and "breakpoint" commands. But
building on that, are the concepts of a list range (in the "list"
command), and a breakpoint location (in the "break" command - it adds an
optional condition). These are all well defined in gdb.
In the corresponding Python gdb-like debugger, I have already done this.
So I already have grammars for these
<https://github.com/rocky/python3-trepan/blob/master/trepan/processor/parse/parser.py#L104-L146>
.
There is one deviation I make from gdb parsing. Rather than use some sort
of semantic lookup to distinguish between a function name and a file path,
I require the former be suffixed with "()". Having used this for a while it
really isn't a burden to add the () when needed and it makes things simpler
and cleaner looking.
I'd be eternally grateful if someone would volunteer to write the Marpa
equivalent grammar and enough of the semantics to capture the result
information in some sort of structure. To make things easier you don't have
to integrate this into Devel::Trepan if you don't want, it can just be a
standalone program. But if you want to add it it that's fine too ;-)
Having this done would cement Marpa and Devel::Trepan together, and I can
make a plug for it at YAPC 2018. (I was going to make a plug for it, anyway
as something I had intended to do. Of course it is better show that it's
been done rather than mention it as a todo item.)
Thanks everyone.
Post by Jeffrey Kegler
On Fri, Jun 22, 2018 at 8:30 AM, Lukas Atkinson <
Post by Lukas Atkinson
Hi Rocky,
I don't think there's a way to avoid all those parser states.
However, trying to parse a BUILD_LIST_1000 rule seems very unusual to
me, since you'd have to create a BUILD_LIST<N> rule for all N.
I first scan opcodes for those that list the number of operands they
have, then custom grammar rules are
created for those.
It's clearly not workable to spell out all those rules in the grammar.
Post by Lukas Atkinson
But you can't generalize to a pseudo-BNF rule like build_list(N) ::=
<expr>^N "BUILD_LIST_"<N> because that is not a context-free grammar.
You're thinking about this probably from a different but historically
more prevalent perspective. I'm building a custom grammar based on the
input that happens to appear. I'm taking advantage of one of the beautiful
things about PEG grammars - that they can be created and run at run time
and therefore tailored to specific use cases.
Post by Lukas Atkinson
Trying to directly parse an assembly-like notation back into an
AST-like structure is not a normal use for a parser, and you will have
difficulties with things like jumps, loops, and dup instructions.
Although I need to redo how control flow is done, in fact there has been
a decompiler for Python for a decade and half which works off of an Earley
algorithm parser. The deparser that I've taken over currently works for
python from verison 1.3 circa 1995 to the present 3.7. If one were always
to stay within "normal use", progress would be serioulsy impeded. I'm
imagining someone telling Benjamin Franklin that flying a kit in a
thunderstorm is not the "normal use" of Electricity.
At any rate you can read about this novel use of deparsing in
http://rocky.github.io/Deparsing-Paper.pdf
This is a case were not using Marpa and hand-rolling a program will be
Post by Lukas Atkinson
much easier (possibly using Marpa to parse the disassembled opcodes into a
data structure, but that sounds like overkill). This program would mostly
manipulate AST fragments on a stack, and can easily deal with
variable-arity operations like BUILD_LIST_N (by popping the top n
expressions from the output stack).
Actually, that's what the competing deparsers do. They are not as good
though :-)
Post by Lukas Atkinson
Post by Rocky
To be clear on one thing: it's not simply that there are 1,000 states
at before the reduction, but that at each scan step, all of them get
updated. So we have quadratic behaviour: after seeing second expr we update
1 prior state, then on seeing another expr we update 2 prior states, and
then 3 prior states and so on. So in general we'll have n*(n-2)/2 updates.
ick.
Post by Rocky
Here's a parsing odditiy that came up in a doing this Python
decompiler, and I wonder if there's a good solution for it.
Python bytecode like a number of other bytecode is polish prefix so
the operator comes after all of the operands are seen.
To handle long lists of data in Python you can write
x = [1, 2, 3, 10, 20, ... ]
and the bytecode in some versions of Python might be
LOAD CONST 1
LOAD CONST 2
LOAD_CONST 10
LOAD_CONST 20
...
BUILD_LIST_1000
In pathelogical situations these lists can get arbitrarily long.
The build list has a count of how many operands it expects which
above I fold into the nonterminal. There is a grammar rule that gets
expr ::= build_list_1000
build_list_1000 ::= expr expr expr ... BUILD_LIST_1000
Finally I get to the problem: In an early algorithm parser, to
recognize this there would be 1000 states just for each place where the dot
could be in parsing build_list_1000. How would reduce the number of
intermediate states needed in parsing?
In theory there's always a solution using procedural parsing, but
this can get quite involved I think. Above the expressions were constants
simple constants but in practice and expression could be an aribitrary
expression which is somehting grammars are good at capturing and that I'd
not want to have to redo writing procedural code.
.
--
You received this message because you are subscribed to the Google
Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/marpa-parser/CExOShs6khU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ron Savage
2018-06-24 06:57:58 UTC
Permalink
Hi Rocky

Your Python grammar rules don't define NUMBER, OFFSET, etc. What are they?
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rocky
2018-06-24 09:42:43 UTC
Permalink
Post by Lukas Atkinson
Hi Rocky
Your Python grammar rules don't define NUMBER, OFFSET, etc. What are they?
Anything in all capitals is a terminal symbol and is defined in the scanner.

<https://github.com/rocky/python3-trepan/blob/master/trepan/processor/parse/scanner.py#L110-L129>

Sorry for not being clear on this previously.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ron Savage
2018-06-24 23:26:37 UTC
Permalink
Rocky

Thanx.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...