nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "byron miller (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets
Date Wed, 07 Dec 2005 21:49:08 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359649 ] 

byron miller commented on NUTCH-134:
------------------------------------

I would take more cpu for better summaries any day :) cpu power is cheaper than manual intervention!

If any testing is needed, don't hesitate to drop me a patch.. i've been working on a 500million
page index using mapred branch on a 10 node cluster so i have plenty of numbers to test against.

> Summarizer doesn't select the best snippets
> -------------------------------------------
>
>          Key: NUTCH-134
>          URL: http://issues.apache.org/jira/browse/NUTCH-134
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
>     Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where the frequency
of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add()
operation will add new excerpts only if they are not already present - the test is performed
using the Comparator that compares only the numUniqueTokens. This means that if there are
two or more excerpts, which score equally high, only the first of them will be retained, and
the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly
lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To keep the relative
position of excerpts in the original order the Excerpt class should be extended with an "int
order" field, and the collected excerpts should be sorted in that order prior to adding them
to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message