nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: vote results.
Date Thu, 15 Dec 2005 16:50:33 GMT
Stefan Groschupf wrote:

> Hi,
> I counted the votes manually, I hope I didn't oversee something. I  
> didn't filter out issues that are 0.8 related, since it is good to  
> know community wishes anyway. :-)


Shouldn't the period for voting be a bit longer? I didn't have time to 
vote yet... Anyway, my take on this:


> NUTCH-140    Add alias capability in parse-plugins.xml file that 
> allows  mimeType->extensionId mapping
> 1
> NUTCH-139    Standard metadata property names in the ParseData metadata
> 2

+1

> NUTCH-138    non-Latin-1 characters cannot be submitted for search
> 1
> NUTCH-3    multi values of header discarded   
> 1


+1

>
> NUTCH-134    Summarizer doesn't select the best snippets   
> 1


+1
I have some patches, which use Lucene Highlighter package instead.

> NUTCH-98    RobotRulesParser interprets robots.txt incorrectly
> 1
> NUTCH-120    one "bad" link on a page kills parsing   
> 3
> NUTCH-127    uncorrect values using -du, or ls does not return items
> 2


+1

> NUTCH-126    Fetching via https does not work with a proxy (patch)
> 1
> NUTCH-125    OpenOffice Parser plugin   
> 2


+1. Ready to commit, I'll do it tomorrow.

> NUTCH-110    OpenSearchServlet outputs illegal xml characters
> 1
> NUTCH-36    Chinese in Nutch   
> 1
> NUTCH-123    Cache.jsp some times generate NullPointerException
> 1
> NUTCH-121    SegmentReader for mapred   
> 2


Nearly ready to commit, I can do it probably by the end of the week. 
However, this is valid only for the mapred branch, so it doesn't affect 
the release.

> NUTCH-119    Regexp to extract outlinks incorrect   
> 1
> NUTCH-115    jobtracker.jsp shows too much information   
> 1
> NUTCH-108    tasktracker crashs when reconnecting to a new jobtracker.
> 1
> NUTCH-113    Disable permanent DNS-to-IP caching for JVM 1.4
> 1
> NUTCH-111    ndfs.replication is not documented within the nutch- 
> default.xml configuration file.
> 1
> NUTCH-100    New plugin urlfilter-db   
> 1
> NUTCH-106    Datanode corruption   
> 1
> NUTCH-95    DeleteDuplicates depends on the order of input segments
> 1


+1

> NUTCH-92    DistributedSearch incorrectly scores results   
> 2


+1. However, solving this correctly is _hard_ ... it's a very similar 
problem to the MultiSearcher in Lucene, and it took that group quite 
some time to reach an acceptable solution...

> NUTCH-91    empty encoding causes exception   
> 1
> NUTCH-52    Parser plugin for MS Excel files   
> 1
> NUTCH-74    French Analyzer Plugin   
> 1
> NUTCH-64    no results after a restart of a search--server (without  
> tomcat restart)
> 1
> NUTCH-68    A tool to generate arbitrary fetchlists   
> 1
> NUTCH-62    Add html META tag information into metaData in index-more  
> plugin
> 1
> NUTCH-61    Adaptive re-fetch interval. Detecting umodified content
> 1

+1. I think this is an important feature. I have some patches, which 
need to be updated. However, I wouldn't be so bold as to commit them 
just before a release. There are quite a few subtle issues with the 
segment handling if you use this.

> NUTCH-13    If dns points to 127.0.0.1, the url is also crawled
> 1
> NUTCH-48    "Did you mean" query enhancement/refignment feature request
> 1
> NUTCH-45    Log corrupt segments in SegmentMergeTool   
> 1
> NUTCH-24    Cannot handle incorrectly cased Content-Type   
> 1


Isn't this solved already?

> NUTCH-16    boost documents matching a url pattern   
> 1
>
>
>


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message