nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aisha <>
Subject Re: Problem parsing some MS Excel & other formats (Office 2003)
Date Fri, 20 Oct 2006 08:11:13 GMT

Hi Andrzej ,

Thank you for your reply,

As I have a lot of raised exception, Could you please have a look at it and
said me if there is a way to solve them : 

  -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be
handled as micrsosoft document.
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
instance, the following exception occured: null

  - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't
be handled as micrsosoft document. java.util.NoSuchElementException
  - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't
be handled as micrsosoft document. Invalid header
signature; read 7015536635646467195, expected -2226271756974174256

  - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks unknown protocol: dsp
	at org.apache.nutch.parse.Outlink.<init>(
	at org.apache.nutch.parse.ParseUtil.parse(
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
	at org.apache.nutch.fetcher.Fetcher$

In the last error, the string after "unknown protocol: " is not always dsp,
it seems to be different in each case and I don't understand what mean this

Thank you very much.

Best regards,

Aisha wrote:
> Hi,
> I try with last releases nutch-2006-10-13.tar.gz and
> nutch-2006-10-19.tar.gz,
> but the NPE doesn't seem to be fixed, I always have the same exception
> message for a lot of document and a lot af format, excel but word and
> powerpoint too.....:
> 2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully
> parse
> content file://C:/docs_a_indexer/test.doc of type application/msword
> 2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
> file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as
> Microsoft
> document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files
> are unsupported at this time
> Couls you please help me because the volume of rejected document is
> large.......

The reason for failure means that you can't parse these files using the 
lib-parsems plugins, because they use a "fast save" format, which is not 

Your only option is to use some other external parser through parse-ext 

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View this message in context:
Sent from the Nutch - Dev mailing list archive at

View raw message