Eagerness of the tokenizer/lexer

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Eagerness of the tokenizer/lexer

ftomassetti
Hi,
I am writing a sort of "loose" tokenizer, trying to make it work with different languages. Therefore it has some rules from Java, Ruby etc.

Now, it fails on this line:

"# In this example the &title anchor contains the "Your vehicle's title", not the map."

telling me this:

 - Invalid input '\n', expected Escape or '"' (line 18, pos 86):
# In this example the &title anchor contains the "Your vehicle's title", not the map.

I suspect it is using this rule:

    Rule CharLiteral() {
        return Sequence(
                '\'',
                OneOrMore(FirstOf(Escape(), Sequence(TestNot(AnyOf("'\\")), ANY)).suppressSubnodes()),
                '\''
        );
    }

And I do not understand way, because to go over the # character it should have used this rule:

   @SuppressNode
    Rule Spacing() {
        return OneOrMore(FirstOf(

                // traditional comment
                Sequence("/*", ZeroOrMore(TestNot("*/"), ANY), "*/"),

                // end of line comment
                Sequence(
                        "//",
                        ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
                        FirstOf("\r\n", '\r', '\n', EOI)
                ),

                Sequence(
                        "#",
                        ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
                        FirstOf("\r\n", '\r', '\n', EOI)
                ),

                // whitespace
                OneOrMore(AnyOf(" \t\r\n\f").label("Whitespace"))
        ));
    }

Eating everything until the newline.
Do you have any suggestions? Could I enable some debugging facilities to find out what is trying to do? Should I configure it to be "eager" and eat more characters, maybe?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Eagerness of the tokenizer/lexer

mathias
Administrator
I’d suspect that your `Spacing` rule is not the only place where the `#` character is matched.

You can try to use the `TracingParseRunner` to better understand, how the parser matches the input.
It usually makes sense to first minimise the input as much as possible as the trace log can generate quite a bit of output.

HTH and cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 5 Mar 2014, at 11:47, ftomassetti [via parboiled users] <[hidden email]> wrote:

> Hi,
> I am writing a sort of "loose" tokenizer, trying to make it work with different languages. Therefore it has some rules from Java, Ruby etc.
>
> Now, it fails on this line:
>
> "# In this example the &title anchor contains the "Your vehicle's title", not the map."
>
> telling me this:
>
>  - Invalid input '\n', expected Escape or '"' (line 18, pos 86):
> # In this example the &title anchor contains the "Your vehicle's title", not the map.
>
> I suspect it is using this rule:
>
>     Rule CharLiteral() {
>         return Sequence(
>                 '\'',
>                 OneOrMore(FirstOf(Escape(), Sequence(TestNot(AnyOf("'\\")), ANY)).suppressSubnodes()),
>                 '\''
>         );
>     }
>
> And I do not understand way, because to go over the # character it should have used this rule:
>
>    @SuppressNode
>     Rule Spacing() {
>         return OneOrMore(FirstOf(
>
>                 // traditional comment
>                 Sequence("/*", ZeroOrMore(TestNot("*/"), ANY), "*/"),
>
>                 // end of line comment
>                 Sequence(
>                         "//",
>                         ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
>                         FirstOf("\r\n", '\r', '\n', EOI)
>                 ),
>
>                 Sequence(
>                         "#",
>                         ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
>                         FirstOf("\r\n", '\r', '\n', EOI)
>                 ),
>
>                 // whitespace
>                 OneOrMore(AnyOf(" \t\r\n\f").label("Whitespace"))
>         ));
>     }
>
> Eating everything until the newline.
> Do you have any suggestions? Could I enable some debugging facilities to find out what is trying to do? Should I configure it to be "eager" and eat more characters, maybe?
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Eagerness-of-the-tokenizer-lexer-tp4024278.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Eagerness of the tokenizer/lexer

ftomassetti
The problem was that there was a not-closed string literal eating up a
bunch of characters and leaving the parser a bit confused about what
to do with the rest.

Giving I am trying to bit a non-conventional, loose tokenizer I
inserted a rule to eat either string literals open and closed on the
same line or parse the opening character alone, if the closing one is
not on the same line.

Thank you again for your help!

On Wed, Mar 5, 2014 at 11:55 AM, mathias [via parboiled users]
<[hidden email]> wrote:

> I'd suspect that your `Spacing` rule is not the only place where the `#`
> character is matched.
>
> You can try to use the `TracingParseRunner` to better understand, how the
> parser matches the input.
> It usually makes sense to first minimise the input as much as possible as
> the trace log can generate quite a bit of output.
>
> HTH and cheers,
> Mathias
>
> ---
> [hidden email]
> http://www.parboiled.org
>
> On 5 Mar 2014, at 11:47, ftomassetti [via parboiled users] <[hidden email]>
> wrote:
>
>> Hi,
>> I am writing a sort of "loose" tokenizer, trying to make it work with
>> different languages. Therefore it has some rules from Java, Ruby etc.
>>
>> Now, it fails on this line:
>>
>> "# In this example the &title anchor contains the "Your vehicle's title",
>> not the map."
>>
>> telling me this:
>>
>>  - Invalid input '\n', expected Escape or '"' (line 18, pos 86):
>> # In this example the &title anchor contains the "Your vehicle's title",
>> not the map.
>>
>> I suspect it is using this rule:
>>
>>     Rule CharLiteral() {
>>         return Sequence(
>>                 '\'',
>>                 OneOrMore(FirstOf(Escape(),
>> Sequence(TestNot(AnyOf("'\\")), ANY)).suppressSubnodes()),
>>                 '\''
>>         );
>>     }
>>
>> And I do not understand way, because to go over the # character it should
>> have used this rule:
>>
>>    @SuppressNode
>>     Rule Spacing() {
>>         return OneOrMore(FirstOf(
>>
>>                 // traditional comment
>>                 Sequence("/*", ZeroOrMore(TestNot("*/"), ANY), "*/"),
>>
>>                 // end of line comment
>>                 Sequence(
>>                         "//",
>>                         ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
>>                         FirstOf("\r\n", '\r', '\n', EOI)
>>                 ),
>>
>>                 Sequence(
>>                         "#",
>>                         ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
>>                         FirstOf("\r\n", '\r', '\n', EOI)
>>                 ),
>>
>>                 // whitespace
>>                 OneOrMore(AnyOf(" \t\r\n\f").label("Whitespace"))
>>         ));
>>     }
>>
>> Eating everything until the newline.
>> Do you have any suggestions? Could I enable some debugging facilities to
>> find out what is trying to do? Should I configure it to be "eager" and eat
>> more characters, maybe?
>>
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://users.parboiled.org/Eagerness-of-the-tokenizer-lexer-tp4024278.html
>> To start a new topic under parboiled users, email [hidden email]
>> To unsubscribe from parboiled users, click here.
>> NAML
>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://users.parboiled.org/Eagerness-of-the-tokenizer-lexer-tp4024278p4024279.html
> To unsubscribe from Eagerness of the tokenizer/lexer, click here.
> NAML



--
Website at http://www.federico-tomassetti.it
Indirizzo PEC [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Eagerness of the tokenizer/lexer

mathias
Administrator
Glad you sorted things out!

Cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 7 Mar 2014, at 09:55, ftomassetti [via parboiled users] <[hidden email]> wrote:

> The problem was that there was a not-closed string literal eating up a
> bunch of characters and leaving the parser a bit confused about what
> to do with the rest.
>
> Giving I am trying to bit a non-conventional, loose tokenizer I
> inserted a rule to eat either string literals open and closed on the
> same line or parse the opening character alone, if the closing one is
> not on the same line.
>
> Thank you again for your help!
>
> On Wed, Mar 5, 2014 at 11:55 AM, mathias [via parboiled users]
> <[hidden email]> wrote:
>
> > I'd suspect that your `Spacing` rule is not the only place where the `#`
> > character is matched.
> >
> > You can try to use the `TracingParseRunner` to better understand, how the
> > parser matches the input.
> > It usually makes sense to first minimise the input as much as possible as
> > the trace log can generate quite a bit of output.
> >
> > HTH and cheers,
> > Mathias
> >
> > ---
> > [hidden email]
> > http://www.parboiled.org
> >
> > On 5 Mar 2014, at 11:47, ftomassetti [via parboiled users] <[hidden email]>
> > wrote:
> >
> >> Hi,
> >> I am writing a sort of "loose" tokenizer, trying to make it work with
> >> different languages. Therefore it has some rules from Java, Ruby etc.
> >>
> >> Now, it fails on this line:
> >>
> >> "# In this example the &title anchor contains the "Your vehicle's title",
> >> not the map."
> >>
> >> telling me this:
> >>
> >>  - Invalid input '\n', expected Escape or '"' (line 18, pos 86):
> >> # In this example the &title anchor contains the "Your vehicle's title",
> >> not the map.
> >>
> >> I suspect it is using this rule:
> >>
> >>     Rule CharLiteral() {
> >>         return Sequence(
> >>                 '\'',
> >>                 OneOrMore(FirstOf(Escape(),
> >> Sequence(TestNot(AnyOf("'\\")), ANY)).suppressSubnodes()),
> >>                 '\''
> >>         );
> >>     }
> >>
> >> And I do not understand way, because to go over the # character it should
> >> have used this rule:
> >>
> >>    @SuppressNode
> >>     Rule Spacing() {
> >>         return OneOrMore(FirstOf(
> >>
> >>                 // traditional comment
> >>                 Sequence("/*", ZeroOrMore(TestNot("*/"), ANY), "*/"),
> >>
> >>                 // end of line comment
> >>                 Sequence(
> >>                         "//",
> >>                         ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
> >>                         FirstOf("\r\n", '\r', '\n', EOI)
> >>                 ),
> >>
> >>                 Sequence(
> >>                         "#",
> >>                         ZeroOrMore(TestNot(AnyOf("\r\n")), ANY),
> >>                         FirstOf("\r\n", '\r', '\n', EOI)
> >>                 ),
> >>
> >>                 // whitespace
> >>                 OneOrMore(AnyOf(" \t\r\n\f").label("Whitespace"))
> >>         ));
> >>     }
> >>
> >> Eating everything until the newline.
> >> Do you have any suggestions? Could I enable some debugging facilities to
> >> find out what is trying to do? Should I configure it to be "eager" and eat
> >> more characters, maybe?
> >>
> >> If you reply to this email, your message will be added to the discussion
> >> below:
> >> http://users.parboiled.org/Eagerness-of-the-tokenizer-lexer-tp4024278.html
> >> To start a new topic under parboiled users, email [hidden email]
> >> To unsubscribe from parboiled users, click here.
> >> NAML
> >
> >
> >
> > ________________________________
> > If you reply to this email, your message will be added to the discussion
> > below:
> > http://users.parboiled.org/Eagerness-of-the-tokenizer-lexer-tp4024278p4024279.html
> > To unsubscribe from Eagerness of the tokenizer/lexer, click here.
> > NAML
>
>
>
> --
> Website at http://www.federico-tomassetti.it
> Indirizzo PEC [hidden email]
>
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Eagerness-of-the-tokenizer-lexer-tp4024278p4024280.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.
> NAML

Loading...