nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moreno Feltscher (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script
Date Mon, 15 Jan 2018 23:51:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326640#comment-16326640
] 

Moreno Feltscher commented on NUTCH-2496:
-----------------------------------------

[~markus17]: Thanks for that hint. This is something I still don't really get. Where and to
what steps exactly are those filters/normalizers being applied?

In my case I only have a {{regex-urlfilter.txt}} file as well as the following plugin configuration:
{code:xml}
    <property>
        <name>plugin.includes</name>
        <value>
            protocol-httpclient|protocol-http|urlfilter-regex|index-(basic|anchor|metadata)|headings|language-identifier|query-(basic|site|url|lang)|indexer-elastic-rest|parse-(text|html|tika|metatags)|urlnormalizer-(pass|regex|basic)
        </value>
    </property>
{code}

Would it make sense to disable filtering/normalization in LinkDB?

> Speed up link inversion step in crawling script
> -----------------------------------------------
>
>                 Key: NUTCH-2496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2496
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
>            Priority: Major
>
> While working on a project where I have to index a huge number of URLs I encountered
an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled
upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
> single node.  I run invertlinks only because I need the Inlinks in the 
> indexer step so as to store them with the document.  I do not need the 
> anchor text and I am not scoring.  I am finding that invertlinks (and more 
> specifically the merge of the linkdb) takes a long time - about 30 minutes 
> for a crawl of around 150K documents.  I am looking for ways that I might 
> shorten this processing time.  Any suggestions? 
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters during the
inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could improve things
in a crawl script and speed up the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message