nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dibyendu ghosh (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs
Date Fri, 02 Dec 2011 10:01:41 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161520#comment-13161520
] 

dibyendu ghosh commented on NUTCH-1206:
---------------------------------------

Output of my original test with 1.4:
=======================
bash-2.00$ java TestParse direct.pdf
Converting direct.pdf to html.
All parsing attempts failed
bash-2.00$ cat hadoop.log
2011-12-02 15:39:15,356 INFO  plugin.PluginRepository - Plugins: looking in: /sp
ace/dibyendu/nutch/1.4/runtime/local/plugins
2011-12-02 15:39:15,611 INFO  plugin.PluginRepository - Plugin Auto-activation m
ode: [true]
2011-12-02 15:39:15,611 INFO  plugin.PluginRepository - Registered Plugins:
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         the nutch core e
xtension points (nutch-extensionpoints)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Basic URL Normal
izer (urlnormalizer-basic)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Html Parse Plug-
in (parse-html)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Basic Indexing F
ilter (index-basic)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Http / Https Pro
tocol Plug-in (protocol-httpclient)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         HTTP Framework (
lib-http)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Regex URL Filter
 (urlfilter-regex)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Pass-through URL
 Normalizer (urlnormalizer-pass)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Regex URL Normal
izer (urlnormalizer-regex)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Tika Parser Plug
-in (parse-tika)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         OPIC Scoring Plu
g-in (scoring-opic)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         CyberNeko HTML P
arser (lib-nekohtml)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Anchor Indexing
Filter (index-anchor)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Regex URL Filter
 Framework (lib-regex-filter)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Registered Extension-Poi
nts:
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Nutch URL Normal
izer (org.apache.nutch.net.URLNormalizer)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Nutch Protocol (
org.apache.nutch.protocol.Protocol)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Segment Me
rge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch URL Filter
 (org.apache.nutch.net.URLFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Indexing F
ilter (org.apache.nutch.indexer.IndexingFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         HTML Parse Filte
r (org.apache.nutch.parse.HtmlParseFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Content Pa
rser (org.apache.nutch.parse.Parser)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Scoring (o
rg.apache.nutch.scoring.ScoringFilter)
2011-12-02 15:39:16,794 WARN  parse.ParseUtil - Unable to successfully parse con
tent file:direct.pdf of type application/pdf
2011-12-02 15:39:16,885 WARN  parse.ParseResult - file:direct.pdf is not parsed
successfully, filtering
bash-2.00$ echo $CLASSPATH
conf:lib/nutch-1.4.jar:lib/log4j-1.2.15.jar:lib/commons-logging-1.1.1.jar:lib/ha
doop-core-0.20.2.jar:lib/oro-2.0.8.jar:lib/tika-core-0.10.jar:lib/slf4j-api-1.6.
1.jar:lib/slf4j-log4j12-1.6.1.jar:.
=======================
                
> tika parser of nutch 1.3 is failing to prcess pdfs
> --------------------------------------------------
>
>                 Key: NUTCH-1206
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1206
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>         Environment: Solaris/Linux/Windows
>            Reporter: dibyendu ghosh
>            Assignee: Chris A. Mattmann
>         Attachments: direct.pdf
>
>
> Please refer to this message: http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html.
Old parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) though it
is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does not have parse-pdf plugin
and it is not able to parse even older pdfs.
> my code (TestParse.java):
> ----------------------------
> bash-2.00$ cat TestParse.java
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.PrintStream;
> import java.util.Iterator;
> import java.util.Map;
> import java.util.Map.Entry;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.nutch.metadata.Metadata;
> import org.apache.nutch.parse.ParseResult;
> import org.apache.nutch.parse.Parse;
> import org.apache.nutch.parse.ParseStatus;
> import org.apache.nutch.parse.ParseUtil;
> import org.apache.nutch.parse.ParseData;
> import org.apache.nutch.protocol.Content;
> import org.apache.nutch.util.NutchConfiguration;
> public class TestParse {
>     private static Configuration conf = NutchConfiguration.create();
>     public TestParse() {
>     }
>     public static void main(String[] args) {
>         String filename = args[0];
>         convert(filename);
>     }
>     public static String convert(String fileName) {
>         String newName = "abc.html";
>         try {
>             System.out.println("Converting " + fileName + " to html.");
>             if (convertToHtml(fileName, newName))
>                 return newName;
>         } catch (Exception e) {
>             (new File(newName)).delete();
>             System.out.println("General exception " + e.getMessage());
>         }
>         return null;
>     }
>     private static boolean convertToHtml(String fileName, String newName)
>         throws Exception {
>         // Read the file
>         FileInputStream in = new FileInputStream(fileName);
>         byte[] buf = new byte[in.available()];
>         in.read(buf);
>         in.close();
>         // Parse the file
>         Content content = new Content("file:" + fileName, "file:" +
> fileName,
>                                       buf, "", new Metadata(), conf);
>         ParseResult parseResult = new ParseUtil(conf).parse(content);
>         parseResult.filter();
>         if (parseResult.isEmpty()) {
>             System.out.println("All parsing attempts failed");
>             return false;
>         }
>         Iterator<Map.Entry&lt;Text,Parse>> iterator =
> parseResult.iterator();
>         if (iterator == null) {
>             System.out.println("Cannot iterate over successful parse
> results");
>             return false;
>         }
>         Parse parse = null;
>         ParseData parseData = null;
>         while (iterator.hasNext()) {
>             parse = parseResult.get((Text)iterator.next().getKey());
>             parseData = parse.getData();
>             ParseStatus status = parseData.getStatus();
>             // If Parse failed then bail
>             if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
>                 System.out.println("Could not parse " + fileName + ". " +
>                             status.getMessage());
>                 return false;
>             }
>         }
>         // Start writing to newName
>         FileOutputStream fout = new FileOutputStream(newName);
>         PrintStream out = new PrintStream(fout, true, "UTF-8");
>         // Start Document
>         out.println("<html>");
>         // Start Header
>         out.println("<head>");
>         // Write Title
>         String title = parseData.getTitle();
>         if (title != null && title.trim().length() > 0) {
>             out.println("<title>" + parseData.getTitle() + "</title>");
>         }
>         // Write out Meta tags
>         Metadata metaData = parseData.getContentMeta();
>         String[] names = metaData.names();
>         for (String name : names) {
>             String[] subvalues = metaData.getValues(name);
>             String values = null;
>             for (String subvalue : subvalues) {
>                 values += subvalue;
>             }
>             if (values.length() > 0)
>                 out.printf("<meta name=\"%s\" content=\"%s\"/>\n",
>                            name, values);
>         }
>         out.println("<meta http-equiv=\"Content-Type\"
> content=\"text/html;charset=UTF-8\"/>");
>         // End Meta tags
>         out.println("</head>"); // End Header
>         // Start Body
>         out.println("<body>");
>         out.print(parse.getText());
>         out.println("</body>"); // End Body
>         out.println("</html>"); // End Document
>         out.close(); // Close the file
>         return true;
>     }
> }
> ----------------------------
> command:
> ======
> bash-2.00$ java -classpath
> conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:.
> TestParse direct.pdf
> ======
> output:
> _____
> Converting direct.pdf to html.
> Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration
> getConfResourceAsInputStream
> INFO: found resource tika-mimetypes.xml at
> file:/path/to/nutch/1.3/conf/tika-mimetypes.xml
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser
> parsePluginFolder
> INFO: Plugins: looking in: /path/to/nutch/1.3/runtime/local/plugins
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Plugin Auto-activation mode: [true]
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Registered Plugins:
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   the nutch core extension points (nutch-extensionpoints)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Tika Parser Plug-in (parse-tika)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Registered Extension-Points:
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Protocol (org.apache.nutch.protocol.Protocol)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Segment Merge Filter
> (org.apache.nutch.segment.SegmentMergeFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch URL Filter (org.apache.nutch.net.URLFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Content Parser (org.apache.nutch.parse.Parser)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> Oct 19, 2011 5:05:20 PM org.apache.hadoop.conf.Configuration
> getConfResourceAsInputStream
> INFO: found resource parse-plugins.xml at
> file:/path/to/nutch/1.3/conf/parse-plugins.xml
> Oct 19, 2011 5:05:20 PM org.apache.nutch.parse.ParserFactory matchExtensions
> INFO: The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are
> enabled via the plugin.includes system property, and all claim to support
> the content type application/pdf, but they are not mapped to it  in the
> parse-plugins.xml file
> Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseUtil parse
> WARNING: Unable to successfully parse content file:direct.pdf of type
> application/pdf
> Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseResult filter
> WARNING: file:direct.pdf is not parsed successfully, filtering
> All parsing attempts failed
> _____
> my customized nutch-site.xml:
> ~~~~~~~~~~~~~~~~~~~~
> bash-2.00$ cat conf/nutch-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>   <property>
>     <name>plugin.folders</name>
>     <value>runtime/local/plugins</value>
>     <description>Directories where nutch plugins are located.  Each
>     element may be a relative or absolute path.  If absolute, it is used
>     as is.  If relative, it is searched for on the classpath.</description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>     <value>parse-tika</value>
>     <description>Regular expression naming plugin directory names to
>     include. Any plugin not matching this expression is excluded.
>     </description>
>   </property>
> </configuration>
> ~~~~~~~~~~~~~~~~~~~~

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message