nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moreno Feltscher (JIRA)" <>
Subject [jira] [Created] (NUTCH-2496) Speed up link inversion step in crawling script
Date Fri, 12 Jan 2018 23:33:00 GMT
Moreno Feltscher created NUTCH-2496:

             Summary: Speed up link inversion step in crawling script
                 Key: NUTCH-2496
             Project: Nutch
          Issue Type: Improvement
            Reporter: Moreno Feltscher

While working on a project where I have to index a huge number of URLs I encountered an issue
with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon
the same issue as described here:
I am running the invertlinks step in my Nutch 1.6 based crawl process on a 
single node.  I run invertlinks only because I need the Inlinks in the 
indexer step so as to store them with the document.  I do not need the 
anchor text and I am not scoring.  I am finding that invertlinks (and more 
specifically the merge of the linkdb) takes a long time - about 30 minutes 
for a crawl of around 150K documents.  I am looking for ways that I might 
shorten this processing time.  Any suggestions? 

Back then [~wastl-nagel] suggested turning off the normalizers and filters during the inversion
step which speeds up the process a bunch.
In my case however I kind of depend on those so this is no real solution.

I opened this issue here in order to get some feedback on how we could improve things in a
crawl script and speed up the process.

This message was sent by Atlassian JIRA

View raw message