Quantcast

Lexical/Syntactical parsing vs PEG's combined parsing

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Lexical/Syntactical parsing vs PEG's combined parsing

Eliah
Dear Mathias,

Are there any performance advantages in parsing source code of C-like languages(Java, for example) in two phases(lexical parsing first, then syntactical) instead of parsing in one phase, which is ordinary for PEG grammar.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lexical/Syntactical parsing vs PEG's combined parsing

mathias
Administrator
Hi Eliah,

this question is not that easy to answer.
It all depends on the amount of backtracking a PEG parser needs to do for "normal/ordinary" input of the language.

Java, for example, is quite PEG-friendly in that normal Java code requires very little backtracking. The parboiled JavaParser for example does about 10% backtracking, i.e. about 10% of all rule evaluations are re-matches of rules that very already evaluated previously.
This means that the PEG parser runs about 10% longer than with perfect memory and no backtracking.

The markdown parser used in pegdown for example does much more backtracking (about 50%). This means that a perfect-memory parser that would not need to backtrack would run about twice as fast.

From what I read about other peoples experiences with parsing C with PEG, C is very PEG unfriendly and can cause an enormous amount of backtracking, causing the parser to run many, many times longer than without backtracking, sometimes even not really finishing at all (because of exponential runtime).

If you do separate parsing in two phases (Lexing and Token parsing), this gives you two advantages:
1. Improved performance for all backtracking cases, since the lexer never backtracks and always creates the tokens in one go, whereas as normal PEG parser would have to reparse on the character level in all backtracking cases.
2. Increased look-ahead range. This is irrelevant for PEGs since they dont rely on look-ahead, however, "normal" LL(x) parsers need look-ahead for their operation, and look-ahead on the character level is quite short-sighted.

IMHO it is the second point that caused separation of lexing and parsing to appear in the first place many, many years ago when computing resources were limited and non-backtracking LL(x) parsers where the only way to parse input with reasonable speed.

Of cause the separation of lexing and token parsing does come with a cost: complexity. You need to specify a lexer in addition to a parser. Both work inherently differently. There are now token objects that need to be managed and so on. Your parser is separated from the actual input by an additional layer. Error recovery is more difficult and somewhat limited (e.g. you cannot correct typos in keywords).

So, as you see, there is no short answer... :)

Cheers,
Mathias


---
[hidden email]
http://www.parboiled.org

On 13.04.2011, at 11:24, Eliah [via parboiled users] wrote:

> Dear Mathias,
>
> Are there any performance advantages in parsing source code of C-like languages(Java, for example) in two phases(lexical parsing first, then syntactical) instead of parsing in one phase, which is ordinary for PEG grammar.
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Lexical-Syntactical-parsing-vs-PEG-s-combined-parsing-tp2815281p2815281.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lexical/Syntactical parsing vs PEG's combined parsing

Gavin
Mathias,

Pardon me for hijacking the question, but I want to get some clarifications regarding your reply.

You mention that Java is PEG-friendly while C is not so. What attributes of a grammar would make it
unsuitable for parboiled ?

From my study, the term "Packrat parsing" is an implementation technique for parsing "PEG" grammars.
When you say "Java is PEG-friendly", are you using PEG and the term "packrat parsing" interchangeably ?
Since packrat parsing guarantees that the parsing will terminate in linear time;  and if you say that
some languages can cause parboiled to run in exponential time, doesn't that imply that parboiled is
not using packrat parsing ? I'm confused & hope you can clarify.

Thanks for a high quality project.

Regards,

Gavin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lexical/Syntactical parsing vs PEG's combined parsing

mathias
Administrator
> You mention that Java is PEG-friendly while C is not so. What attributes of a grammar would make it
> unsuitable for parboiled ?

Most real-world grammars do not cause a PEG parser to exhibit a large amount of backtracking on regular input.
If you have a language that has a lot of elements that look identical from the "left side", i.e. where there are long common prefixes between them and the decision whether the element is of type A or of type B is only possible by parsing them to the right end, this will cause a lot of backtracking thereby making PEGs less suitable for the language.

> From my study, the term "Packrat parsing" is an implementation technique for parsing "PEG" grammars.

Yes, PEGs do not require packrat parsing. A PEG parser may use packratting or may not. Apart from the runtime and memory consumption characteristics there is no difference between the two.

> When you say "Java is PEG-friendly", are you using PEG and the term "packrat parsing" interchangeably ?

No, by PEG-friendly I mean that the amount of backtracking in a non-packratting parser is limited, i.e. the appeal of using packrat parsing is small or even negative.

> Since packrat parsing guarantees that the parsing will terminate in linear time;  and if you say that
> some languages can cause parboiled to run in exponential time, doesn't that imply that parboiled is
> not using packrat parsing ?

Correct, parboiled does not do packrat parsing. It can use a simple memoization technique (check the last paragraph of https://github.com/sirthias/parboiled/wiki/Parsing-Performance-Tuning), but it chooses to concentrate on properly supporting the large majority of real-world languages for which packrat parsing does not have any benefits.

Hopefully this clarifies things somewhat.

Cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 14.04.2011, at 15:11, Gavin [via parboiled users] wrote:

> Mathias,
>
> Pardon me for hijacking the question, but I want to get some clarifications regarding your reply.
>
> You mention that Java is PEG-friendly while C is not so. What attributes of a grammar would make it
> unsuitable for parboiled ?
>
> From my study, the term "Packrat parsing" is an implementation technique for parsing "PEG" grammars.
> When you say "Java is PEG-friendly", are you using PEG and the term "packrat parsing" interchangeably ?
> Since packrat parsing guarantees that the parsing will terminate in linear time;  and if you say that
> some languages can cause parboiled to run in exponential time, doesn't that imply that parboiled is
> not using packrat parsing ? I'm confused & hope you can clarify.
>
> Thanks for a high quality project.
>
> Regards,
>
> Gavin
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Lexical-Syntactical-parsing-vs-PEG-s-combined-parsing-tp2815281p2820232.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.

Loading...