nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy Galema (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x
Date Tue, 20 Nov 2012 10:34:58 GMT


Ferdy Galema commented on NUTCH-1495:

Fair enough.

I understand the reasoning of deleting the normalized rows. I don't think we need to implement
the normalizeKeep. We just need to realize and document  the deleting behaviour thoroughly.

About the inlinks/outlinks issue: My bad, I made a mistake thinking that the inlinks are not
cleared prior to determining the new inlinks. As I've just checked it seems that they *are*
cleared (see inlinkedScoreData.clear() in DbUpdateReducer). It depends on the DataStore how
the clearing of entire maps are implemented. In any case it's not relevant to this issue.
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>                 Key: NUTCH-1495
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Nathan Gass
>         Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, patch-updatedb-normalize-filter-2012-11-13.txt
> AFAIS in nutch 1.x you could change your url filters and normalizers during the crawl,
and update the db using crawldb -normalize -filter. There does not seem to be a away to achieve
the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the nutch 2.x
updatedb command. I have no experience with any of the used technologies including java, so
please check the attached code carefully before using it. I'm very interested to hear if this
is the right approach or any other comments.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message