nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (NUTCH-1252) SegmentReader -get shows wrong data
Date Mon, 08 Oct 2012 22:10:03 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel reassigned NUTCH-1252:
--------------------------------------

    Assignee: Sebastian Nagel
    
> SegmentReader -get shows wrong data
> -----------------------------------
>
>                 Key: NUTCH-1252
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1252
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4, 1.5
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>             Fix For: 1.6
>
>         Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch
>
>
> The command/option -get of the SegmentReader may show wrong data associated with the
given URL. 
> To reproduce:
> {code}
> % mkdir -p test_readseg/urls
> % echo -e "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0"
> test_readseg/urls/seeds
> % nutch inject test_readseg/crawldb test_readseg/urls
> Injector: starting at 2012-01-18 09:32:25
> Injector: crawlDb: test_readseg/crawldb
> Injector: urlDir: test_readseg/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
> % nutch generate test_readseg/crawldb test_readseg/segments/
> Generator: starting at 2012-01-18 09:32:30
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: test_readseg/segments/20120118093232
> Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
> % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' -nocontent -noparse
-nofetch -noparsedata -noparsetext
> SegmentReader: get 'http://nutch.apache.org/'
> Crawl Generate::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Jan 18 09:32:26 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 10.0
> Signature: null
> Metadata: _ngt_: 1326875550401test: AbcTest
> {code}
> The metadata and the score indicate that the CrawlDatum shown is the wrong one (that
associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message