lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <>
Subject Re: Searching doubt
Date Tue, 04 Aug 2009 10:56:10 GMT
If you don't know which tokens you'll face, then it's really a much harder
problem. If you know where the token is, e.g. it's always in<here will be the token to break>/index.html,
then it eases the task a bit. Otherwise you'll need to search every single
token produced. I can think of several ways to break "aboutus" to "about
us", or any other sequence for that matter:

1) Break it to "a boutus", "ab outus" ... "about us", "aboutu s", index all
of them in the same position. Expensive though. This I'd recommend only if
you know where this token is located (otherwise it will explode your term

2) Use a dictionary (real dictionary), and search it for every substring,
e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there.
This needs some fine tuning, like checking if the rest is also a word and if
the full string is also a word, so that you don't break up meaningful words.
You'll need to get a dictionary for that.

The key though - do you know exactly where this token is? Otherwise, every
solution will be a killer to performance.


On Tue, Aug 4, 2009 at 12:59 PM, m.harig <> wrote:

> Thanks ,
>              i've noticed that , but the code is for known tokens, how do i
> do it for dynamic tokens , meaning , i don't know the urls , someone picked
> up the urls and i'll index it. Is there any technique to use while indexing
> ? am using lucene 2.4.0 version. Please suggest me.
> --
> View this message in context:
> Sent from the Lucene - Java Users mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message