nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction
Date Tue, 19 Jan 2016 11:17:39 GMT


Markus Jelsma updated NUTCH-1233:
    Attachment: pre-1233.txt

Two lists of extracted URL's, before and after. One hyperlink is missing but with 1233 we
get lots of anchors that were missing.

> Rely on Tika for outlink extraction
> -----------------------------------
>                 Key: NUTCH-1233
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>         Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, NUTCH-1233-1.6-2.patch,
NUTCH-1233.patch, post-1233.txt, pre-1233.txt
> Tika provides outlink extraction features that are not used in Nutch. To be able to use
it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't.
There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new
version this issue can be worked on. Here's preliminary code that does both Tika and current
outlink extraction. This also includes parts of the Boilerpipe code.

This message was sent by Atlassian JIRA

View raw message