Quantcast

how to segment user tags with an underscore, like @jsimao_71 ?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

how to segment user tags with an underscore, like @jsimao_71 ?

jsimao71
Hi!

I've been using Paraboloid/pegdown to parse my HTML, but user tags are not segmented property when the user name contain an underscore, such as @jsimao_71.
I'm using the following configuration:

Parser parser = Parboiled.createParser(Parser.class, Extensions.AUTOLINKS | Extensions.ABBREVIATIONS | Extensions.FENCED_CODE_BLOCKS);

String text = "@jsimao_71".

RootNode node = parser.parse(text.toCharArray());

printing the nodes by recursion gives:

RootNode [0-10]
SuperNode [0-10]
TextNode [0-7] '@jsimao'
SpecialTextNode [7-8] '_'
TextNode [8-10] '71'

Notice that the '_' was not considered as part of the word.

What is the best way to make this work an intended. Can additional "Extensions.*" be setup in parser configuration. Or does this feature requires some change in source code. If yes, what is the simplest way to do that?!?

Thanks in advance,
Jorge.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to segment user tags with an underscore, like @jsimao_71 ?

mathias
Administrator
Jorge,

you are not saying what you actually want to do.
pegdown parses underscores as `SpecialTextNode` instances, yes.
Why is this a problem?

Cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 03.06.2013, at 14:33, jsimao71 [via parboiled users] <[hidden email]> wrote:

>
>
> Hi!
>
> I've been using Paraboloid/pegdown to parse my HTML, but user tags are not
> segmented property when the user name contain an underscore, such as
> @jsimao_71.
> I'm using the following configuration:
>
> Parser parser = Parboiled.createParser(Parser.class, Extensions.AUTOLINKS |
> Extensions.ABBREVIATIONS | Extensions.FENCED_CODE_BLOCKS);
>
> String text = "@jsimao_71".
>
> RootNode node = parser.parse(text.toCharArray());
>
> printing the nodes by recursion gives:
>
> RootNode [0-10]
> SuperNode [0-10]
> TextNode [0-7] '@jsimao'
> SpecialTextNode [7-8] '_'
> TextNode [8-10] '71'
>
> Notice that the '_' was not considered as part of the word.
>
> What is the best way to make this work an intended. Can additional
> "Extensions.*" be setup in parser configuration. Or does this feature
> requires some change in source code. If yes, what is the simplest way to do
> that?!?
>
> Thanks in advance,
> Jorge.
>
>
>
>
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/how-to-segment-user-tags-with-an-underscore-like-jsimao-71-tp4024202.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, visit
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to segment user tags with an underscore, like @jsimao_71 ?

jsimao71
Thanks for your fast reply, Mathias!

Sorry that I was not clear enough in the first post.

Basically, I would like to have the underscore to be considered a "word character", rather than a separated character (i.e. be made part of parsed words).

So the string @jsimao_71 would produce a single TextNode with the full text content. (Rather than 2 TextNode and a SpecialTextNode in between).

Would welcome any help you could give on this regard.
If this can be done by pure configuration when the parser is created, that would be great. Otherwise, some easy fix or extension on the source code would also work fine.

Thanks in advance for you help..

Cheers,
Jorge.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to segment user tags with an underscore, like @jsimao_71 ?

mathias
Administrator
Jorge,

> Basically, I would like to have the underscore to be considered a "word
> character", rather than a separated character (i.e. be made part of parsed
> words).

Ok, since the underscore has special status in markdown the parser treats it specially.
Why can't your logic reading the AST pegdown produces deal with the current output?

It seems that changing the parser to produce an AST that's easier to read for your application is much harder that changing your application to be able to deal with the current AST, no?

Cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 06.06.2013, at 12:52, jsimao71 [via parboiled users] <[hidden email]> wrote:

>
>
> Thanks for your fast reply, Mathias!
>
> Sorry that I was not clear enough in the first post.
>
> Basically, I would like to have the underscore to be considered a "word
> character", rather than a separated character (i.e. be made part of parsed
> words).
>
> So the string @jsimao_71 would produce a single TextNode with the full text
> content. (Rather than 2 TextNode and a SpecialTextNode in between).
>
> Would welcome any help you could give on this regard.
> If this can be done by pure configuration when the parser is created, that
> would be great. Otherwise, some easy fix or extension on the source code
> would also work fine.
>
> Thanks in advance for you help..
>
> Cheers,
> Jorge.
>
>
>
> _______________________________________________
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/how-to-segment-user-tags-with-an-underscore-like-jsimao-71-tp4024202p4024206.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, visit
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: how to segment user tags with an underscore, like @jsimao_71 ?

jsimao71
No so, Mathias!

I'm using the Visitor in a multi-thread environment, so that would not be so practical.

I created a CustomParser extends Parser that overrides the Parser#SpecialChar(), so not consider the underscore '_' a special char. This fixes the problem.
--

Unfortunatelly, due to the implementation of Parser and/or BaseParser, I was getting a java.lang.IllegalAccessError -- (probably because setAccessible is not being called in some members during reflection operation performed during the parser build.)
So I had to cut&past Parser as an artifact in to CustomParser package -- and in that case the problem is fixed.
I'm still using version pegdown-1.1.0.jar  not sure If this issue is still there in a newer version.

If you have suggestions, please do...

Thanks for helping..

Jorge.
Loading...