nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1414) Date extraction parse filter
Date Mon, 18 Jul 2016 20:19:20 GMT


Markus Jelsma commented on NUTCH-1414:

The regex parse filter NUTCH-2227 can grab stuff from HTML. But you still need to translate
them to proper Solr date formats. You could take the code of NUTCH-2227 as an example to find
dates and then change them into the proper format.

> Date extraction parse filter
> ----------------------------
>                 Key: NUTCH-1414
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
> Date extraction parse filter for Nutch to provide means to extract an arbitrary page
date (article date) from the parse text.

This message was sent by Atlassian JIRA

View raw message