nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aisha <aichaso...@yahoo.com>
Subject Re: Problem parsing some MS Excel & other formats (Office 2003)
Date Fri, 20 Oct 2006 08:11:13 GMT

Hi Andrzej ,

Thank you for your reply,

As I have a lot of raised exception, Could you please have a look at it and
said me if there is a way to solve them : 

  -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be
handled as micrsosoft document.
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
instance, the following exception occured: null

  - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't
be handled as micrsosoft document. java.util.NoSuchElementException
 
  - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't
be handled as micrsosoft document. java.io.IOException: Invalid header
signature; read 7015536635646467195, expected -2226271756974174256

  - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: dsp
	at java.net.URL.<init>(URL.java:574)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
	at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
	at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
	at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84)
	at
org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)


In the last error, the string after "unknown protocol: " is not always dsp,
it seems to be different in each case and I don't understand what mean this
string.

Thank you very much.

Best regards,
Aïcha 

Aisha wrote:
> Hi,
>
> I try with last releases nutch-2006-10-13.tar.gz and
> nutch-2006-10-19.tar.gz,
> but the NPE doesn't seem to be fixed, I always have the same exception
> message for a lot of document and a lot af format, excel but word and
> powerpoint too.....:
>
> 2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully
> parse
> content file://C:/docs_a_indexer/test.doc of type application/msword
> 2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
> file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as
> Microsoft
> document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files
> are unsupported at this time
>
> Couls you please help me because the volume of rejected document is
> large.......
>   

The reason for failure means that you can't parse these files using the 
lib-parsems plugins, because they use a "fast save" format, which is not 
supported.

Your only option is to use some other external parser through parse-ext 
plugin.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





-- 
View this message in context: http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6911914
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Mime
View raw message