nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-1414) Date extraction parse filter
Date Mon, 18 Jul 2016 20:07:20 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382957#comment-15382957
] 

Markus Jelsma edited comment on NUTCH-1414 at 7/18/16 8:06 PM:
---------------------------------------------------------------

It operates on the parsed text or the extracted text if you use an extractor, so it will miss
dates on most pages or just find the wrong date. This plugin only translates 'free text' dates
to Date objects and picks the first it finds. It does not locate the correct date of the article,
which is way more difficult and needs much more context than just plain extracted text.

Adding it to plugin.includes and index.parse.md makes it work. Use bin/nutch indexchecker
command to test what output goes to the search engine.

If this doesn't work for you and you still need it, we can provide a custom solution that
does better text and date extraction, provides language and cookie detection and more.



was (Author: markus17):
It operates on the parsed text or the extracted text if you use an extractor, so it will miss
dates on most pages or just find the wrong date. This plugin only translates 'free text' dates
to Date objects and picks the first it finds. It does not locate the correct date of the article,
which is way more difficult and needs much more context than just plain extracted text.

Adding it to plugin.includes and index.parse.md makes it work. Use bin/nutch indexchecker
command to test what output goes to the search engine.



> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an arbitrary page
date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message