Quantcast

Matching a long list of strings

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Matching a long list of strings

mtruyens
Hello there!

I'm trying to create a Parboiled Scala parser that needs to recognize words out of several Lists, each containing about 100 items:

def myRule = rule { listA | listB | listC | listD | listE}

Each list is composed on the basis of a predefined dictionary of words, essentially by simply concatenating options using the | operator. For example, listA is composed as follows:

def listA = rule {
val sortedList = wordsFromA.sortWith(_.length > _.length) // sort by descending length
var result = str(sortedList(0))
for (i <- 1 until sortedList.length) result = result | str(tagsSorted(i))    
result
}

While the lists contain similar words, they do not contain overlapping words between them. Note that some lists contain very short words (e.g., one single letter) while other lists may contain longer words of up to ten letters.  

While all of this seems very simple, the parser fails to function, except if I limit to just a few lists (e.g., listA and listB) -- as soon as I add more lists, it starts failing. Also, I found that the order in which I list them (e.g., "listB | listA | listC") seems to be giving different results (of course I start with the lists with the longest words).

What am I doing wrong? Is there a maximum number of words that can be tested for? (On a side note: why is there no "FirstOf" function in the Scala version? It seems to achieve what I am trying above.)

Many thanks for your wonderful tool!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matching a long list of strings

mathias
Administrator
The reason that you are seeing different parsing behavior when changing the order you concatenate your lists (e.g., "listB | listA | listC") is that some lists contain prefixes of items in other lists.
There are at least two solutions for this:
1. Dont just match the word itself but also match (or require with the `&` operator) a following whitespace character. This prevents a match on a mere prefix of a word.
2. Merge _all_ your lists first, then sort by descending word length, then convert into rules.

The second option will result in a more performant parser while the first one might be architecturally cleaner...

Cheers,
Mathias

---
[hidden email]
http://www.parboiled.org

On 21.02.2012, at 07:43, mtruyens [via parboiled users] wrote:

> Hello there!
>
> I'm trying to create a Parboiled Scala parser that needs to recognize words out of several Lists, each containing about 100 items:
>
> def myRule = rule { listA | listB | listC | listD | listE}
>
> Each list is composed on the basis of a predefined dictionary of words, essentially by simply concatenating options using the | operator. For example, listA is composed as follows:
>
> def listA = rule {
> val sortedList = wordsFromA.sortWith(_.length > _.length) // sort by descending length
> var result = str(sortedList(0))
> for (i <- 1 until sortedList.length) result = result | str(tagsSorted(i))    
> result
> }
>
> While the lists contain similar words, they do not contain overlapping words between them. Note that some lists contain very short words (e.g., one single letter) while other lists may contain longer words of up to ten letters.  
>
> While all of this seems very simple, the parser fails to function, except if I limit to just a few lists (e.g., listA and listB) -- as soon as I add more lists, it starts failing. Also, I found that the order in which I list them (e.g., "listB | listA | listC") seems to be giving different results (of course I start with the lists with the longest words).
>
> What am I doing wrong? Is there a maximum number of words that can be tested for? (On a side note: why is there no "FirstOf" function in the Scala version? It seems to achieve what I am trying above.)
>
> Many thanks for your wonderful tool!
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Matching-a-long-list-of-strings-tp3762782p3762782.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.
> NAML

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matching a long list of strings

mathias
Administrator
In reply to this post by mtruyens
One more thing:

> On a side note: why is there no "FirstOf" function in the Scala version? It seems to achieve what I am trying above

There is! The '|' operator in parboiled for Scala is exactly what the `FirstOf` helper is on the Java side...

Cheers,
Mathias

---
[hidden email]
http://www.decodified.com

On 21.02.2012, at 07:43, mtruyens [via parboiled users] wrote:

> Hello there!
>
> I'm trying to create a Parboiled Scala parser that needs to recognize words out of several Lists, each containing about 100 items:
>
> def myRule = rule { listA | listB | listC | listD | listE}
>
> Each list is composed on the basis of a predefined dictionary of words, essentially by simply concatenating options using the | operator. For example, listA is composed as follows:
>
> def listA = rule {
> val sortedList = wordsFromA.sortWith(_.length > _.length) // sort by descending length
> var result = str(sortedList(0))
> for (i <- 1 until sortedList.length) result = result | str(tagsSorted(i))    
> result
> }
>
> While the lists contain similar words, they do not contain overlapping words between them. Note that some lists contain very short words (e.g., one single letter) while other lists may contain longer words of up to ten letters.  
>
> While all of this seems very simple, the parser fails to function, except if I limit to just a few lists (e.g., listA and listB) -- as soon as I add more lists, it starts failing. Also, I found that the order in which I list them (e.g., "listB | listA | listC") seems to be giving different results (of course I start with the lists with the longest words).
>
> What am I doing wrong? Is there a maximum number of words that can be tested for? (On a side note: why is there no "FirstOf" function in the Scala version? It seems to achieve what I am trying above.)
>
> Many thanks for your wonderful tool!
>
> If you reply to this email, your message will be added to the discussion below:
> http://users.parboiled.org/Matching-a-long-list-of-strings-tp3762782p3762782.html
> To start a new topic under parboiled users, email [hidden email]
> To unsubscribe from parboiled users, click here.
> NAML

Loading...