nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Gass (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x
Date Tue, 20 Nov 2012 09:40:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500943#comment-13500943
] 

Nathan Gass commented on NUTCH-1495:
------------------------------------

I remember running in an Exception when directly adding the new normalized links to the outlinks,
thats why I used a newNormalizations map. Removing did, until now at least, not throw any
exception, perhaps to removeFromOutlinks method or the getOutlinks method is helping here
(as said I'm no java programmer)?

About testing, my current setup is actually not distributed (so all my tests where in local
mode) and I did  not yet look into nutch 1.x tests. If they have anything about crawldb -normalize
-filter I could reuse that. I assume this two are the minimum to get the patch in. If anything
else is missing, please let me know.

I'm currently of the opinion, that just removing keys which were normalized is the best default
approach. The newly normalized outlinks will add a new entry if necessary and we avoid any
possible inconsistencies at the cost of some refetches. Moreover this avoids the additional
costs of having to read and write all webpage fields when -normalize is enabled.
My own use-case is to remove dupes because of previously unknown session ids so I'll have
most normalized urls already in the db anyway.

We could add an additional option -normalizeKeep or similar for the dangerous and costly variant
which tries to avoid the refetches. But given that we avoid a lot of the complexity of the
second patch if we just not support this, I'm compelled to leave this feature out.

I don't understand why inlinks and outlinks could get out of sync. I will have to think more
about it when I have time.

 
                
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
>                 Key: NUTCH-1495
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1495
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Nathan Gass
>         Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, patch-updatedb-normalize-filter-2012-11-13.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during the crawl,
and update the db using crawldb -normalize -filter. There does not seem to be a away to achieve
the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the nutch 2.x
updatedb command. I have no experience with any of the used technologies including java, so
please check the attached code carefully before using it. I'm very interested to hear if this
is the right approach or any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message