nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction
Date Mon, 13 Aug 2012 16:46:38 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433278#comment-13433278
] 

Ken Krugler commented on NUTCH-1233:
------------------------------------

Hi Markus - two questions. First, is the current Tika (1.1) outlink extraction support sufficient?
Second, do you think whitespace trimming should happen in Tika or externally? I'm not sure,
as I guess there might be an issue where somebody wants the extract same anchor text as what
was in the HTML, but seems odd.
                
> Rely on Tika for outlink extraction
> -----------------------------------
>
>                 Key: NUTCH-1233
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1233
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.6
>
>         Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, NUTCH-1233-1.6-2.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be able to use
it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't.
There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new
version this issue can be worked on. Here's preliminary code that does both Tika and current
outlink extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message