lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Sub-Sequence token filter
Date Fri, 16 May 2014 19:38:23 GMT
Hi,

I don't have system that searches on URLs. So I don't fully follow. 
But I remember people use URLClassifyProcessorFactory



On Friday, May 16, 2014 8:33 PM, Nitzan Shaked <nitzan.shaked@gmail.com> wrote:
Doesn't look like it. If I understand it correctly,
PathHierarchyTokenizerFactory
will only output prefixes. I support suffixes as well, plus the
ever-so-useful "unanchored" sub-sequences. Using domains again as an
example, I can use my suggestion to query "market.ebay" and find "
www.market.ebay.com" (domains completely made up for the sake of this
example).



On Fri, May 16, 2014 at 7:53 PM, Ahmet Arslan <iorixxx@yahoo.com> wrote:

> Hi Nitzan,
>
> Cant you do what you described with PathHierarchyTokenizerFactory?
>
>
> http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html
>
> Ahmet
>
>
>
>
>
> On Friday, May 16, 2014 5:13 PM, Nitzan Shaked <nitzan.shaked@gmail.com>
> wrote:
> Hi list
>
> I created a small token filter which I'd gladly "contribute", but want to
> know if there's any interest in it before I go and make it pretty, add
> documentation, etc... ;)
>
> I originally created it to index domain names: I wanted to be able to
> search for "google.com" and find "www.google.com" or "ads.google.com", "
> mail.google.com", etc.
>
> What it does is split a token (in my case -- according to "."), and then
> outputs all sub-sequences. So "a,b,c,d" will output "a", "b", "c", "d",
> "a.b", "b.c", "c.d", "a.b.c", "b.c.d", and "a.b.c.d". I use it only in the
> "index" analyzer, and so am able to specify any of the generated tokens to
> find the original token.
>
> It has the following arguments:
>
> sepRegexp: regular expression that the original token will be split
> according to. (I use "[.]" for domains)
> glue: string that will be used to join sub-sequences back together (I use
> "." for domains)
> minLen: minimum generated sub-sequence length
> maxLen: maximum generated sub-sequence length (0 for unlimited, negative
> numbers for token length minus specified amount)
> anchor: "start" to only output prefixes, "end" to only output suffix, or
> "none" to output any sub-sequence
>
> So... is this useful to anyone?
>
>


Mime
View raw message