nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Kugel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1622) Create Outlinks with metadata
Date Tue, 06 May 2014 10:38:15 GMT


Daniel Kugel commented on NUTCH-1622:

I don't have any strong feeling to where this code should be, so feel free to move it around.

To my understanding the content should be only parsed in the parsing phase, so if any metadata
is extracted it should be extracted at that stage.
Are you suggesting the DbUpdate code to parse the content again?
Metadata extraction seems like a parser feature because it is the only component that should
read ("parse") the content and it seems reasonable to have a metadata aware parsers and metadata-ignorant
When adding a metadata element the parser is the only one who know what type of data he is
currently parsing.
Perhaps we can add some form of hook methods or plugins for the parsers themselves to control
what to do with each element they encounter? To decide if its metadata or not and if so what
to do with it? I agree it seems complicated but on the other hand who else is eligible to
parse content other than the parser?

> Create Outlinks with metadata
> -----------------------------
>                 Key: NUTCH-1622
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.8, 2.4
>         Attachments: NUTCH-1622-2.x.patch, NUTCH-1622.patch
> Having the possibility to specify metadata when creating an outlink is extremely useful
as it allows to pass information from a source page to the pages it links to. We use that
routinely within our custom parsers in combination with the url-meta plugin.

This message was sent by Atlassian JIRA

View raw message