Recursive garmmar rules cause StackOverflowError

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Recursive garmmar rules cause StackOverflowError

Marcus
Hi,
I am trying to parse wiki text with nested lists in this form:

* Item 1
* Item 2
** Item 2.1
** Item 2.2

** This sentence is not part of the list; it is just bold. **


My problem is that I keep running into StackOverflowErrors at the construction phase and I can't see how to avoid it.
This is an example of what I've been trying:

    public final static int MAX_NEST_LEVEL = 5;

    public Rule List(int level) {
        return Sequence(
                level <= MAX_NEST_LEVEL,
                Test(ListStart(level)),
                push(new BulletList()),
                OneOrMore(ListItem(level), addChild()));
    }

    public Rule ListStart(int level) {
        return Sequence(
                Optional(HorizontalWhitespaces()),
                ListMarker(level),
                TestNot('*'));
    }

    public Rule ListItem(int level) {
        return Sequence(
                Optional(HorizontalWhitespaces()),
                ListMarker(level),
                push(new ListItem()),
                FirstOf(
                        SubList(level),
                        RowOfInlineElements()));
    }

    public Rule ListMarker(int level) {
        return String(StringUtils.repeat('*', level));
    }

    public Rule SubList(int level) {
        return Sequence(List(level + 1), addChild());
    }


Please could you explain how to handle or rewrite recursive grammar rules like this?

Many thanks,
Marcus
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Recursive garmmar rules cause StackOverflowError

sulfinu
This is the same problem as described here:
https://github.com/sirthias/parboiled/wiki/Indentation-Based-Grammars
with the difference that in your case it's the count of '*' that matters, not tabs. While you could apply that solution to resolve a non-context free grammar with a PEG parser, please read on.

Specifically in your case, Parboiled instrumentation goes into an infinite loop (thus throwing a StackOverflowError) because actions are NOT executed when during this stage. They get evaluated only when parsing input, found that out on my own.
So your safeguard measure "ACTION(level <= MAX_NEST_LEVEL)" is useless. And needless, see below.

What you can do to get rid of infinite looping caused by this line
Sequence(List(level + 1), addChild())
is to get rid firstly of the level parameter in your rules! Parboiled rule caching will take care of recursion then. The trick (more like a cheat, from a theoretical point of view) is to wrap the starting markers "   ***...*" in a single rule that saves via an ACTION() the nesting level in a parser class member field. You'll also need a member field for the current nesting level, initialized with 0 or 1 or whatever your convention is.
Inside a list you always check in an ACTION() that the parsed starting marker has the same level as the current level. At the begining of a sublist, increment the current level with an ACTION(). At the end of a list, decrement the current level with an ACTION().
This is a tested approach :)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Recursive garmmar rules cause StackOverflowError

Marcus
Thanks sulfinu! your trick works and my infinite loops are now solved, but it introduces problems when parsing edge cases, like this tricky Creole text (Creole uses double-asterisk markup for bold text and list nesting can only increase by one level at a time):

*** Not a list item **
* ** Item 1 **
**** Item 1.1 **

According to the creole spec I think the rendered result of the above wiki text should look like this:

<p><strong>* Not a list item</strong></p>
<ul>
<li><strong>Item 1</strong>
<ul>
<li><strong>Item 1.1</strong></li>
</ul>
</li>
</ul>

My problem now is that the whole sequence of '****..' characters at the start of a line is eaten by my matching rules, which are like this:

    OneOrMore('*'),
    saveNestingLevel(matchLength()),

Is there a way to avoid eating all of the asterisks at the start of the line? Can I match "Up to N" characters, where N is the current nesting level?

Or would the indentation based grammars be more approptiate here?

Thanks for your help!
Marcus
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Recursive garmmar rules cause StackOverflowError

sulfinu
There's no way you can count characters in the grammar itself.
But you can deal with these cases in the saveNestingLevel() method, by treating differences in nesting level of 2 as bold text markers - just put that information in another class member and remember to test that the bold mark is closed in the same line.

If bold markers can get nested themselves, well, it's even more complicated, but the same approach applies ;)
Be sure to accept this bold markers only as long as you can distinguish between legal extra * and plain syntactic mistakes...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Recursive garmmar rules cause StackOverflowError

Marcus
This might be a hack; I have been trying to use this technique to count characters in the grammar:


   public Rule ListMarker(final char marker) {
        CountedCharacterMatcher markerMatcher = new CountedCharacterMatcher() {
            @Override
            public char getCharacter() {
                return marker;
            }

            @Override
            public int getCount() {
                return currentListLevel;
            }
        };

        return Sequence(
                Optional(HorizontalWhitespaces()),
                markerMatcher);
    }


The CountedCharacterMatcher class is derived from CustomMatcher and I think this approach can replace the saveNestingLevel action, because the ListMarker rule will only match if there are a correct number of list markers for the current nesting level.

The benefit is I don't need special cases to handle bold text markers.  Are there any downsides to using a CustomMatcher like this?

Thanks, Marcus
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Recursive garmmar rules cause StackOverflowError

sulfinu
If it works then you're fine. I, for one, haven't gone that deep into Parboiled (yet).
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Recursive garmmar rules cause StackOverflowError

mathias
Administrator
In reply to this post by Marcus
> Is there a way to avoid eating all of the asterisks at the start of the line? Can I match "Up to N" characters, where N is the current nesting level?

Yes, you can.
You will have to use parser (predicate) actions to do this if the matching rule is dynamic and not static, i.e. you don't know what N is at the time of parser creation but only during run time.

Check out this rule:

    public Rule Heading() {
        Var<Int> count = new Var<Int>();
        return Sequence(
            OneOrMore("*"),
            count.set(context.getMatchLength()),
            OneOrMore(TestNot("*"), ANY),
            OneOrMore("*"),
            context.getMatchLength() == count.get()
        );
    }

The idea is to first match a number of stars and save the count of stars in an action variable, so you can later verify that the second bunch of stars has the same count.

Note that in this particular case you could also use this much more elegant solution not requiring parser actions (since the N is not arbitrary but must match a previous number of matches):

    public Rule Heading() {
        return Sequence('*', InnerHeading(), '*');
    }

    public Rule InnerHeading() {
        return FirstOf(
            Heading(),
            OneOrMore(TestNot("*"), ANY),            
        );
    }

Cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 05.08.2011, at 03:25, Marcus [via parboiled users] wrote:

> Thanks sulfinu! your trick works and my infinite loops are now solved, but it introduces problems when parsing edge cases, like this tricky Creole text (Creole uses double-asterisk markup for bold text and list nesting can only increase by one level at a time):
>
> *** Not a list item **
> * ** Item 1 **
> **** Item 1.1 **
>
> According to the creole spec I think the rendered result of the above wiki text should look like this:
>
> <p><strong>* Not a list item</strong></p>
> <ul>
> <li><strong>Item 1</strong>
> <ul>
> <li><strong>Item 1.1</strong></li>
> </ul>
> </li>
> </ul>
>
> My problem now is that the whole sequence of '****..' characters at the start of a line is eaten by my matching rules, which are like this:
>
>     OneOrMore('*'),
>     saveNestingLevel(matchLength()),
>
> Is there a way to avoid eating all of the asterisks at the start of the line? Can I match "Up to N" characters, where N is the current nesting level?
>
> Or would the indentation based grammars be more approptiate here?
>
> Thanks for your help!
> Marcus
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Recursive-garmmar-rules-cause-StackOverflowError-tp3213553p3227273.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.

Loading...