nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-882) Design a Host table in GORA
Date Fri, 20 Apr 2012 09:58:41 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ferdy Galema updated NUTCH-882:
-------------------------------

    Attachment: NUTCH-882-v3.txt
                NUTCH-882-v3.txt

New version of patch. (On behalf of Mathijs I am finishing this issue. Nevertheless he has
done much of the hard work!)

Building hostdb links (inlinks and outlinks at the host level) works now too. Use:
org.apache.nutch.host.HostDbUpdateJob -linkDb

This patch adds Host store definitions to the gora mapping for HBase only. (Other stores can
be added easily later on). It needs GORA-105. So you can only use the added functionality
when using a trunk version of Gora. Or wait until Nutchgora updates to Gora 0.2. (Should be
soon).

No tests are included yet. For now this is okay, because by default this patch does not change
existing functionality. (Also it's a bit of a pain to add tests because current tests depend
on a valid SQLStore but updating Gora results in a dropped SQLStore so there an issue that
needs to be solved first. In another issue that is).

Will commit this in a few days.
                
> Design a Host table in GORA
> ---------------------------
>
>                 Key: NUTCH-882
>                 URL: https://issues.apache.org/jira/browse/NUTCH-882
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: nutchgora
>            Reporter: Julien Nioche
>             Fix For: nutchgora
>
>         Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and domains?) would
be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of threads, min
time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments are of
course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message