nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
Date Thu, 21 Oct 2010 11:47:17 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki  reassigned NUTCH-907:
---------------------------------------

    Assignee: Andrzej Bialecki 

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data,
linkdb, etc) by specifying a path where the data was stored. This enabled users to run several
disjoint crawls with different configs, but still using the same storage medium, just under
different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore
instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can
create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify
the particular crawl dataset. Then the Nutch API should be extended to allow passing this
"crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message