lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter
Date Tue, 01 Jul 2008 19:52:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12609694#action_12609694
] 

Steven Rowe commented on LUCENE-1320:
-------------------------------------

Hi Karl,

The classes you introduce here look interesting, but the documentation is very sparse.
  
Things I think should be addressed in the documentation:

* Where would you see this stuff being used - on the query side or the indexing side?  Or
both? 
* Where would matrix come from in a real-world scenario?  It looks like there are (at least)
three mechanisms for constructing the matrix - which one(s) make sense where?
* What do payloads have to do with the whole thing?  (Looks like weight?  ShingleMatrixFilter.calculateShingleWeight()
should be explained at the class level - since it's public, I assume you mean for it to be
overridable?)
* The various ShingleMatrixFilter constructors should have javadoc explaining their use.
* This class's use of the new flags feature looks interesting - a discussion in the documentation
would be useful for future implementations.

A couple of random notes:

* Missing Apache license declarations: PrefixAndSuffixAwareTokenFilter.java and TestPrefixAndSuffixAwareTokenFilter.java
* Since you only use SingleTokenTokenStream in your tests, and since it likely will only ever
be used in tests, I think it should be moved from src/java/ to src/test/.
* TestShingleMatrixFilter.TokenListStream looks generally useful for testing filters - maybe
this could be pulled out as a separate class, maybe into the o.a.l.analysis.miscellaneous
package?
* On line #83 of TestShingleMatrixFilter, it looks like the first assignment to ts could be
removed:

{code:java}
83:   ts = tls;
84:   ts = new ShingleMatrixFilter(ts, 2, 2, null);
{code}


> ShingleMatrixFilter, a three dimensional permutating shingle filter
> -------------------------------------------------------------------
>
>                 Key: LUCENE-1320
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1320
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>         Attachments: LUCENE-1320.txt, LUCENE-1320.txt
>
>
> Backed by a column focused matrix that creates all permutations of shingle tokens in
three dimensions. I.e. it handles multi token synonyms.
> Could for instance in some cases be used to replaces 0-slop phrase queries with something
speedier.
> {code:java}
> Token[][][]{
>   {{hello}, {greetings, and, salutations}},
>   {{world}, {earth}, {tellus}}
> }
> {code}
> passes the following test  with 2-3 grams:
> {code:java}
> assertNext(ts, "hello_world");
> assertNext(ts, "greetings_and");
> assertNext(ts, "greetings_and_salutations");
> assertNext(ts, "and_salutations");
> assertNext(ts, "and_salutations_world");
> assertNext(ts, "salutations_world");
> assertNext(ts, "hello_earth");
> assertNext(ts, "and_salutations_earth");
> assertNext(ts, "salutations_earth");
> assertNext(ts, "hello_tellus");
> assertNext(ts, "and_salutations_tellus");
> assertNext(ts, "salutations_tellus");
> {code}
> Contains more and less complex tests that demonstrate offsets, posincr, payload boosts
calculation and construction of a matrix from a token stream.
> The matrix attempts to hog as little memory as possible by seeking no more than maximumShingleSize
columns forward in the stream and clearing up unused resources (columns and unique token sets).
Can still be optimized quite a bit though.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message