nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-1414) Date extraction parse filter
Date Mon, 18 Jul 2016 20:24:20 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382987#comment-15382987
] 

Markus Jelsma edited comment on NUTCH-1414 at 7/18/16 8:23 PM:
---------------------------------------------------------------

The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need to translate
them to proper Solr date formats. You could take the code of NUTCH-2227 as an example to find
dates and then change them into the proper format.

edit: well actually, it is an bad example as it only sets some flag to true. You could try
modifying this patch to just look for the HTML tags you described, that should work better.


was (Author: markus17):
The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need to translate
them to proper Solr date formats. You could take the code of NUTCH-2227 as an example to find
dates and then change them into the proper format.

> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an arbitrary page
date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message