lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Practical usages of arbitrary Shingles when using a query parser?
Date Tue, 31 Jul 2018 15:51:43 GMT

: The query parser is confused by these overlapping positions indeed, which
: it interprets as synonyms. I was going to write that you should set the

Sure -- i'm not blaming the QueryParser, what it does with the 
Shingles output makes sense (and actual works! .. just not as efficiently 
as possible).  I'm trying to figure out how to make the ShingleFilter 
output more useful in the query time analyzer usecase.

: it interprets as synonyms. I was going to write that you should set the
: same min and max shingle sizes at query time, but while writing that I
: realized that you probably wanted to keep outputing shorter shingles so
: that a phrase query on 2 terms with a max shingle size of 3 would still use

Yes exactly ... if at index time you output both unigrams and shingles of 
sizes 2-5, and at query time you have a "phrase" of only 2 words, ideally 
the filter should output a simple Token so you can make a single TermQuery 
-- likewise if you have a phrase of 3 words, or 4, words, or 5 words 
thouse should ideally all produces single tokens.

Your suggestion of "same min & max at query time" where min=max=X is 
something i briefly considered, but that means you're only optimizing the 
"phrases" of length "X", all shorter phrases just use unigrams, and in 
fact there is no point in building shingles of any size othe then X at 
index time.

: shingles? Maybe 'outputUnigramsIfNoShingles' should really be something
: like 'outputShinglesOfTheMaximumSizeOnly'?

That's what i was thinking -- but i haven't dug into the code enough to 
understand how complex that would be. (i was starting with "Am i missing 
something about how/why this shouldn't/doesn't already exist?")

: For the record, in addition to the problems that you mentioned,
: ShingleFilter proved very hard to be fixed in order to work correctly on
: top of synonyms when X != Y[1], which encouraged Alan work on a new
: FixedShingleFilter[2] that deals with index-time synonyms (ie. ignores

Yeah ... i can't even imagine the complexity of dealing with "graph" based 
synonyms and shinles (didn't read your link for fear of my own sanity)

: position length) just fine but only allows X == Y. Also instead of feeding
: an analyzer with shingles to the query parser, we found it more
: user-friendly to add an option to text fields in order to index 2-shingles
: into a separate field and redirect phrase queries to it.[3] We did

Right ... i'm actually looking at a system know that puts uni-shingles, 
bi-shingles, and tri-shingles in 3 diff fields, and then pre-parses the 
input to figure out how long it is to decide which field to query ... i'm 
trying to simplify that.

Ideally what I'd like to be able to say is "give me a phrase, if the 
field is configured w/o any shingles at all it will work fine (via 
PhraseQuery), but if the analyzer is configured with shingles it will be 
even faster (via term query) if/when the query phrase is "shorter" then 
the max shingles length.


-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message