nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
Date Tue, 28 Nov 2006 14:44:22 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453934 ] 
            
Chris A. Mattmann commented on NUTCH-407:
-----------------------------------------

I'm not entirey sure what the right answer to this is. One thing that I do know is that a
colleague at my own work ran into this exact same issue while first attempting to use Nutch
on his enterprise search application. Confused the heck out of him and he ended up including
in the urlfilter-regex what Andrzej mentions above, i.e., only crawl from the top-level down.
He mentioned to me that he thought this was a "kludge" and I can't say that I disagreed with
him. My +1 for figuring  out a better way to solve this problem...

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Assigned To: Andrzej Bialecki 
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message