lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From trhodesg <trhodes...@gmail.com>
Subject Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?
Date Fri, 20 Mar 2015 16:10:44 GMT

  
    
  
  
    Sorry, i can see my post is munged.
      This seems to display it legibly 
         
http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-td4173808.html

      
      I'm new to all this, so i hesitate to say the indexing isn't
      correct. But my understanding is the query, "republic
        of china", will only match 
        the indexing, republic(n) of(n+1) china(n+2)  Since
        the original APTF indexes this as republic(n) of(n+3) china(n+7)
      that query will fail. Wouldn't it be more logical to leave the
      original token numbering unchanged and just add the phrase token
      with the same number as the last word in the matched series? 
      
      BTW, i looked at your code re this. It is quite informative to a
      newbie. Thanks! 
      
      
      On 3/19/2015 11:38 AM, James Strassburg [via Lucene] wrote: 
    
     Sorry, I've been a bit unfocused from this list for a
      bit. When I was
      
      working with the APTF code I rewrote a big chunk of it and didn't
      include
      
      the inclusion of the original tokens as I didn't need it at the
      time. That
      
      feature could easily be added back in. I will see if I can find a
      bit of
      
      time for that.
      
      
      As for the other part of your message, are you suggesting that the
      token
      
      indexes are not correct? There is a bit of a formatting issue with
      the text
      
      and I'm not sure what you're getting at. Can you explain further
      please?
      
      
      On Sun, Feb 8, 2015 at 3:04 PM, trhodesg &lt; [hidden email] &gt;
      wrote:
      
      
        &gt; Thanks to everyone for the thought, time and effort put
        into
        
        &gt; AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
        
        &gt; While trying to add APTF to my indexing, i discovered that
        the original
        
        &gt; (TS)
        
        &gt; version throws an exception while indexing a 100MB PDF. The
        error
        
        &gt; isException writing document to the index; possible
        analysis errorThe
        
        &gt; modified (JS) version runs without error, but it removes
        the tokens used to
        
        &gt; create the phrase. They are needed.
        
        &gt; Before looking into this i have a question; Solr would
        normally tokenize
        
        &gt; the
        
        &gt; phrasethe peoples republic of china isasthe(1) peoples(2)
        republic(3) of(4)
        
        &gt; china(5) is(6)
        
        &gt; Defining the APTF phrase file asthe Solr admin analysis
        page reports that
        
        &gt; the APTF indexer tokenizes the phrase asWould it be
        possible for someone to
        
        &gt; explain the reasoning behind the discontinuous token
        numbering? As it is
        
        &gt; now
        
        &gt; phrase queries such as "republic of china" will fail. And i
        can't get
        
        &gt; proximity queries like "republic of"~10 to work either
        (though it seems
        
        &gt; they
        
        &gt; should). Wouldn't it be more flexible to return the
        following
        
        &gt; tokenizationThis allows spurious matches such as "peoples
        peoplesrepublic"
        
        &gt; but it seems like this type of event would be very rare. It
        has the
        
        &gt; advantage of allowing phrase queries to continue working
        the way most users
        
        &gt; think.
        
        &gt; Thank you for supporting more than one entity definition
        per phrase (ie
        
        &gt; peoplesrepublic and peoplesrepublicofchina). This is type
        of contraction is
        
        &gt; common in longer documents, especially when the first used
        phrase ends with
        
        &gt; a preposition. It helps support robust matching.
        
        &gt;
        
        &gt;
        
        &gt;
        
        &gt; --
        
        &gt; View this message in context:
        
        &gt; http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html

        &gt; Sent from the Solr - User mailing list archive at
        Nabble.com.
        
        &gt;
        
      
      
      
      
      
        If you reply to this email, your
          message will be added to the discussion below: 
        http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194036.html

      
      
        To unsubscribe from Have anyone used Automatic Phrase
        Tokenization (AutoPhrasingTokenFilterFactory) ?, click
          here . 
        NAML  
    
    
  





--
View this message in context: http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194205.html
Sent from the Solr - User mailing list archive at Nabble.com.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message