huge list of Words to pick from

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

huge list of Words to pick from

dudewithkeyboard
Hi,

First of all, thank you so much for maintaining grappa. It's been a real life-saver for me.

Now to my question...

I have a semi-complicated pattern that i need to find in "normal" text. Part of that pattern is a list of specific identifications. My pattern either starts or ends with one of those identifiers. Unfortunately those identifiers can look pretty much like normal words which makes it hard to use a generic pattern to describe them (It Works but delivers too many "false-positives").

I'll prob. have to go with a massive trie() to avoid false positives. We're talking about 20k identifiers here so massive really doesn't overstate things.

Would this even be possible or would the parser complain at some point about too many "options" (the list grows as well) ?

Does anyone have a better idea ?
fge
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: huge list of Words to pick from

fge
dudewithkeyboard wrote
Hi,

First of all, thank you so much for maintaining grappa. It's been a real life-saver for me.

Now to my question...

I have a semi-complicated pattern that i need to find in "normal" text. Part of that pattern is a list of specific identifications. My pattern either starts or ends with one of those identifiers. Unfortunately those identifiers can look pretty much like normal words which makes it hard to use a generic pattern to describe them (It Works but delivers too many "false-positives").

I'll prob. have to go with a massive trie() to avoid false positives. We're talking about 20k identifiers here so massive really doesn't overstate things.

Would this even be possible or would the parser complain at some point about too many "options" (the list grows as well) ?

Does anyone have a better idea ?
Hello,

There is no reason why it shouldn't be possible, and in fact a trie is probably the best solution you can go for here.

However, what do you mean by "the list grows"? Do you mean that it grows at runtime?

Also, note two things:

* you can build that list in the constructor without a problem;
* remind that a trie will always find the LONGEST match; that is, if you have both "foo" and "foobar" in the trie and your input text at this index is "foobarbaz" then "foobar" will be matched; regardless of whether "foo" appears before or after "foobar" in the argument list, since a trie is insensitive to ordering.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: huge list of Words to pick from

dudewithkeyboard
Hi,

Thank you for your fast response !

No the list won't grow at runtime. It'll just grow over time as more identifiers are added but they are not added at runtime.

Okay, I guess I'll implement it that way then :)
fge
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: huge list of Words to pick from

fge
dudewithkeyboard wrote
Hi,

Thank you for your fast response !

No the list won't grow at runtime. It'll just grow over time as more identifiers are added but they are not added at runtime.

Okay, I guess I'll implement it that way then :)
You're welcome, however consider that there is _another_ option; if this list is so huge as to render the initialization of the parser impractical timewise, you have the option to use an action instead in your parser, which would check the correctness of the match for you:

Rule possibleIdentifier()
{
    return // whatever; a sequence of characters etc; a generic match
}

Rule identifier()
{
    return sequence(possibleIdentifier(), actualIdentifier(match()));
}

boolean actualIdentifier(final String match)
{
    // return true or false depending on whether the identifier is recognized
}


More generally, any method returning a boolean within a grammar can be used as an action in a Rule.
Loading...