nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1
Date Fri, 08 Mar 2013 19:42:14 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lewis John McGibbney updated NUTCH-817:
---------------------------------------

    Fix Version/s: 1.7
    
> parse-(html)does follow links of full html page, parse-(tika) does follow any links and
stops at level 1
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-817
>                 URL: https://issues.apache.org/jira/browse/NUTCH-817
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1
>         Environment: Suse linux 11.1, java version "1.6.0_13"
>            Reporter: matthew a. grisius
>            Assignee: Julien Nioche
>             Fix For: 1.7
>
>         Attachments: sample-javadoc.html
>
>
> submitted per Julien Nioche. I did not see where to attach a file so I pasted it here.
btw: Tika command line returns empty html body for this file.
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
> <!--NewPage-->
> <HTML>
> <HEAD>
> <!-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008-->
> <TITLE>
> Matrix Application Development Kit
> </TITLE>
> <SCRIPT type="text/javascript">
>     targetPage = "" + window.location.search;
>     if (targetPage != "" && targetPage != "undefined")
>        targetPage = targetPage.substring(1);
>     function loadFrames() {
>         if (targetPage != "" && targetPage != "undefined")
>              top.classFrame.location = top.targetPage;
>     }
> </SCRIPT>
> <NOSCRIPT>
> </NOSCRIPT>
> </HEAD>
> <FRAMESET cols="20%,80%" title="" onLoad="top.loadFrames()">
> <FRAMESET rows="30%,70%" title="" onLoad="top.loadFrames()">
> <FRAME src="overview-frame.html" name="packageListFrame" title="All Packages">
> <FRAME src="allclasses-frame.html" name="packageFrame" title="All classes and interfaces
(except non-static nested types)">
> </FRAMESET>
> <FRAME src="overview-summary.html" name="classFrame" title="Package, class and interface
descriptions" scrolling="yes">
> <NOFRAMES>
> <H2>
> Frame Alert</H2>
> <P>
> This document is designed to be viewed using the frames feature. If you see this message,
you are using a non-frame-capable web client.
> <BR>
> Link to<A HREF="overview-summary.html">Non-frame version.</A>
> </NOFRAMES>
> </FRAMESET>
> </HTML>

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message