nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lufeng (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1529) Port nutch-mongdb-parser to trunk
Date Fri, 01 Mar 2013 04:35:12 GMT


lufeng updated NUTCH-1529:

    Attachment: NUTCH-1529-trunk-v3.patch

@Lewis add the mongodb dependency in ivy.xml
@Tejas It will write the urls and another fields like fetchInterval to standard output like
DmozParser does.

Command like:
mkdir mongodb
bin/nutch mongodb:// -collection
urls -fields url,score,fetchInterval -outputFieldNames ,nutch.score,nutch.fetchInterval -query
url:apache -queryRegex -sortBy score > mongodb/urls

this means it will connect the crawldb database and get urls collection, retrieval fields
are url,score,fetchInterval , for each retrieval fields, the output keys are "",nutch.score,nutch.fetchInterval,
and query field is url with regex pattern "apache", and all records are sorted by score.

output may like this:	nutch.score=2.0	nutch.fetchInterval=3000	nutch.score=1.0	nutch.fetchInterval=10000

Thanks Lewis and Tejas
> Port nutch-mongdb-parser to trunk
> ---------------------------------
>                 Key: NUTCH-1529
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 1.6
>            Reporter: Lewis John McGibbney
>            Assignee: lufeng
>            Priority: Minor
>             Fix For: 1.7
>         Attachments: NUTCH-1529-trunk.patch, NUTCH-1529-trunk-v2.patch, NUTCH-1529-trunk-v3.patch
> The initial repos is here [0]
> [0]

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message