nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Tanaman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable
Date Tue, 28 Nov 2006 14:37:23 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] 
            
Alan Tanaman commented on NUTCH-407:
------------------------------------

In our team we feel that this patch would have been beneficial in practical terms.  In the
context of the enterprise intelligence solution which we are gradually porting over to Nutch,
the emphasis is on ease of configuration.  We try to avoid exposing features such as regex
filter, which although are very powerful for a more experienced user, are perhaps confusing
to the novice.  This is because we are primarily focused on the enterprise and less on the
WWW.

This is why we preconfigure the db.ignore.external.links property to "true", and then only
the urls file is used to seed the crawl.

Our ideal is to have a collection of predefined configuration settings for specific scenarios
-- e.g. Enterprise-XML, Enterprise-Documents, Enterprise-Database, Internet-News etc.  We
have a script that generates multiple crawlers, each one with different sources to be crawled,
and although possible, it isn't the most practical to change the filters for each one manually
based on the individual user requirements.

I realise this patch is closed, but how about another approach that says that FileResponse.java
looks at db.ignore.external.links and decides based on this whether to go up the tree.

Obviously, this would also prevent you from crawling outlinks to the WWW embedded in documents,
but when crawling an enterprise file system, you usually don't want to go all over the place
anyway.  As I see it, file systems are different to the web in that they are inherently hierarchical
whereas the web is as its name implies, non-hierarchical.  Therefore, when crawling a file
system, "going up" the tree is just as much an external URI (so to speak) as a link to a web
site.

*Ducks for cover*

Alan

> Make Nutch crawling parent directories for file protocol configurable
> ---------------------------------------------------------------------
>
>                 Key: NUTCH-407
>                 URL: http://issues.apache.org/jira/browse/NUTCH-407
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Thorsten Scherler
>         Assigned To: Andrzej Bialecki 
>         Attachments: 407.fix.diff
>
>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06698.html
> I am looking into fixing some very weird behavior of the file protocol.
> I am using 0.8.
> Researching this topic I found 
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06536.html
> and
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> I am on Ubuntu but I have the same problem that nutch is going down the
> tree (including parents) and not up (including children from the root
> url).
> Further I would vote to make the fetch-parents optional and defined per
> a property whether I would like this not very intuitive "feature".

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message