nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Amen (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1414) Date extraction parse filter
Date Mon, 18 Jul 2016 20:13:20 GMT


Cody Amen commented on NUTCH-1414:

So I definitely have a finite set of domains I am indexing. The problem is most of them don't
include a "date created" meta tag, however, many of them have <script type="application/json+ld">
tags that include the date created. Others even have a <time> tag. The problem is my
java programming skills are pretty novice, but I have a lot of experience with Objective-C.
Any advice on how I can extract that application/json+ld ?

> Date extraction parse filter
> ----------------------------
>                 Key: NUTCH-1414
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
> Date extraction parse filter for Nutch to provide means to extract an arbitrary page
date (article date) from the parse text.

This message was sent by Atlassian JIRA

View raw message