nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-157) Problem during parsing msword document . It fetching properly but parsing is not working. Please show me the way how can i parse it
Date Sat, 15 Mar 2008 00:20:24 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578972#action_12578972
] 

Andrzej Bialecki  commented on NUTCH-157:
-----------------------------------------

This branch is in End Of Life status.

> Problem during parsing msword document . It fetching properly but parsing is not working.
Please show me the way how can i parse it
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-157
>                 URL: https://issues.apache.org/jira/browse/NUTCH-157
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.7
>         Environment: windows 
>            Reporter: karamjit
>
> Ms word document  not parsing.
> Error messages :----------
> Page from url Path in fetch ====file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc
> 060301 173204 fetching  file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc
> 060301 173204 Parsing [file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with
[org.apache.nutch.parse.msword.MSWordParser@1e3cd51]
> 060301 173204 fetch of file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc failed
with: java.lang.NoSuchMethodError: org.apache.poi.hpsf.SummaryInformation.getEditTime()J
> 060301 173204 Could not clean the content-type [], Reason is [org.apache.nutch.util.mime.MimeTypeException:
The type can not be null or empty]. Using its raw version...
> 060301 173204 Parsing [file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with
[org.apache.nutch.parse.text.TextParser@b25b9d]
> 060301 173205 status: segment 20060301173203, 1 pages, 1 errors, 35840 bytes, 1000 ms
> 060301 173205 status: 1.0 pages/s, 280.0 kb/s, 35840.0 bytes/page

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message