ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen, Pei" <Pei.C...@childrens.harvard.edu>
Subject CTAKES-63 - Lucene search breaks with a dash(-) and a special tokens such as brackets ]
Date Mon, 01 Oct 2012 21:33:46 GMT
Hi folks,
I was looking into the bug https://issues.apache.org/jira/browse/CTAKES-63
Where the lucene dictionary lookup would break with a search string such as: "mailto:abcoman@t-nec.org<mailto:abcoman@t-nec.org>]"
After some debugging, this happens when the token contains a dash (-), and contains a special
char such as the right bracket].
//I believe all of the chars in the QueryParser str token should be escaped to avoid issues
such as a token ending with ']'

Before we add and test the proposed fixed (add escape() call) such as below, I also noticed
another potential issue: we do search and replace of all dashes into spaces.  Just wanted
to ensure that this was done intentionally and works fine because the dashes have already
been removed in the index.  Otherwise, we'll need to actually replace the dash with a '?'
instead of a space or use a phrasequery instead of termquery.  Would be great if someone familiar
with this bit of code to confirm...

LuceneDictionaryImpl.java (dictionary-lookup) [~Line 106]

              if (str.indexOf('-') == -1) {
                     q = new TermQuery(new Term(iv_lookupFieldName, str));
                     topDoc = iv_searcher.search(q, iv_maxHits);
              else {  // needed the KeyworkAnalyzer for situations where the hypen was included
in the f-word
                     QueryParser query = new QueryParser(Version.LUCENE_30, iv_lookupFieldName,
new KeywordAnalyzer());
                     try {
                           //topDoc = iv_searcher.search(query.parse(str.replace('-', ' ')),
                           //proposed fixed
                            String escaped = QueryParser.escape(str.replace('-', ' '));
                            topDoc = iv_searcher.search(query.parse(escaped), iv_maxHits);
                           } catch (ParseException e) {
                                  // TODO Auto-generated catch block

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message