nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2589) HTML redirections are not followed when using parse-tika
Date Sat, 02 Jun 2018 11:54:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499027#comment-16499027
] 

Hudson commented on NUTCH-2589:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3528 (See [https://builds.apache.org/job/Nutch-trunk/3528/])
NUTCH-2589 HTML redirections are not followed when using parse-tika - (snagel: [https://github.com/apache/nutch/commit/107b364b7f99d89cec76af8c1acc6623fe19a810])
* (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java
* (edit) src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRobotsMetaProcessor.java


> HTML redirections are not followed when using parse-tika
> --------------------------------------------------------
>
>                 Key: NUTCH-2589
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2589
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Priority: Major
>             Fix For: 1.15
>
>
> Html redirections using meta tags are supported in nutch. They work well when using parse-html
to parse files. However, when using parse-tika, they are not detected.
> This is because of https://issues.apache.org/jira/browse/TIKA-2652
> Tika emits redirection meta tags as :
> {code:xml}
> <meta name="refresh" content="0; url=http://example.com"/>
> {code}
> whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following
format :
> {code:xml}
> <meta http-equiv="refresh" content="0; url=http://example.com">
> {code}
> The bug can be reproduced with the following nutch-site.xml:
> {code:xml}
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>     <property>
>         <name>plugin.includes</name>
>         <value>protocol-http|parse-tika</value>
>     </property>
>     <property>
>         <name>http.agent.name</name>
>         <value>blah</value>
>     </property>
> </configuration>
> {code}
> fetching this url: http://www.google.com/policies/technologies/ads/
> The resulting status is {code}success(1,0){code} whereas using parse-html, the resulting
status is {code:html}success(1,100), args[0]=https://policies.google.com/technologies/ads,
args[1]=0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message