lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donna L Gresh <>
Subject RE: Is there a way for me to handle a multiword synonym correctly?
Date Fri, 07 Aug 2009 16:15:18 GMT
I have to think about this a bit, but that may work. I just have to make 
sure no "undesirable" side effects occur. I certainly want to be able to 
search for a phrase and not have it match all the individual bits, but 
that should already work using the mechanism I already have in place.


"Carl Austin" <> wrote on 08/07/2009 10:50:08 AM:

> [image removed] 
> RE: Is there a way for me to handle a multiword synonym correctly?
> Carl Austin 
> to:
> java-user
> 08/07/2009 10:50 AM
> Please respond to java-user
> I may be over simplifying here but in this case don't you just need to
> use an analyzer that breaks the word "SAP.EM.FIN.AM" on full stops and
> throws them out, so that it is indexed as terms "SAP" "EM" "FIN" "AM".
> This is the same as it will index "SAP EM FIN AM" as long as you break
> on whitespace too. I.E SimpleAnalyzer (runs of letter characters are
> tokens)
> Then the query for "SAP EM FIN AM" will match both.
> Carl
> -----Original Message-----
> From: Donna L Gresh [] 
> Sent: 07 August 2009 15:35
> To:
> Subject: Is there a way for me to handle a multiword synonym correctly?
> I saw some discussion on the board but I'm not sure I've got quite the
> same problem. As an example, I have a query that might be a technical
> skill:
> I would like that to match a document that has *either* SAP.EM.FIN.AM or
> "SAP EM FIN AM" (in that order and all together, not spread out through
> the document).
> The approach I had tried was at index time if I saw SAP.EM.FIN.AM I
> would consider "SAP EM FIN AM" a synonym for it, using the Lucene in
> Action example. Luke shows me that I have two terms in the index for
> this
> document: SAP.EM.FIN.AM and "SAP EM FIN AM" (one term). Thus it appears
> differently in the index than if it had been organically found as just
> the string of tokens, in which case there would be separate terms for
> SAP, EM, and so on. 
> At query time if I look for "SAP EM FIN AM" it is formed as a phrase
> query with a slop of 0 which does *not* match the one term version "SAP
> EM FIN AM". (For that matter a simple boolean query doesn't find it
> either) Luke confirms the fact that the phrase query does not find my
> synonym term. The query "SAP EM FIN AM" finds *only* documents that
> originally had those separated tokens in them.
> Is there a way to handle this situation such that at index time I can
> turn SAP.EM.FIN.AM into something that will be found with a query for
> Thanks for any pointers
> Donna 
> This message should be regarded as confidential. If you have 
> received this email in error please notify the sender and destroy it
> immediately.
> Statements of intent shall only become binding when confirmed in 
> hard copy by an authorised signatory.  The contents of this email 
> may relate to dealings with other companies within the Detica 
> Limited group of companies.
> Detica Limited is registered in England under No: 1337451.
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message