lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation
Date Fri, 04 Mar 2011 08:45:36 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002491#comment-13002491
] 

Uwe Schindler edited comment on SOLR-2400 at 3/4/11 8:44 AM:
-------------------------------------------------------------

Stefan, this is a general issue of TokenStreams adding Tokens. TokenStreams that remove Tokens
*should* automatically preserve position, but not even all of those do that correctly (we
were fixing some of them lately). The way of how the Lucene analysis works makes it impossible
to guarantee any corresponence of the position numbers. Because for the indexer it's only
important what comes out at the end, the steps inbetween are not interesting. AnalysisReqHandler
on the other hand does some bad "hacks" to look "inside" the analysis (by using temporary
TokenStreams that buffer tokens), which are not the general use-case of TokenStreams.

I wonder a little bit about your xml file, it only contains text and position, but it should
also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes
not only two of them. Is this a hand-made file or what is the problem? Which Solr version?

One possibility to handle the thing might be the char offset in the original text, because
that the req handler may use the character offset of begin and end of the token in the original
stream instead of the token position, but this is likely to break for lots of TokenFilters
(WordDelimiterFilter would work as long as you don't do stemming before...). The problem is
incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted
terms are longer than their originals.

Alltogether: Its unlikely that you can implement that and it will work for all combinations
of TokenStream components.

      was (Author: thetaphi):
    Stefan, this is an egeneral issue of TokenStreams adding Tokens. TokenStreams that remove
Tokens *should* automatically preserve position, but not even all of those do that correctly
(we were fixing some of them lately). The way of how the Lucene analysis works makes it impossible
to guarantee any corresponence of the position numbers. Because for the indexer its only important
what comes out at the end, the steps inbetween are impossible. AnalysisReqHandler on the other
hand does some bad "hacks" to look "inside" the analysis (by using temporary TokenStreams
that buffer tokens), which are not the general use-case of TokenStreams.

I wonder a little bit about your xml file, it only contains text and position, but it should
also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes
not only two of them. Is this a hand-made file or what is the problem? Which Solr version?

One possibility to handle the thing might be the char offset in the original text, because
that one should point to the character offset of begin and end of the token in the original
stream instead of the token position, but this is likely to break for lots of TokenFilters
(WordDelimiterFilter would work as long as you don't do stemming before...). The problem is
incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted
terms are longer than their originals.

Alltogether: Its unlikely that you can implement that and it will work for all combinations
of TokenStream components.
  
> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
>                 Key: SOLR-2400
>                 URL: https://issues.apache.org/jira/browse/SOLR-2400
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Stefan Matheis (steffkes)
>            Priority: Minor
>         Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information .. which
could be very useful to build an nice Analysis-Output, and that's "Token-Relation" (if there
is special/correct word for this, please correct me).
> Meaning, that is actually not possible to "follow" the Analysis-Process (completly) while
the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens
(f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible to create
an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message