nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "m.harig" <m.ha...@gmail.com>
Subject Re: nutch file content limit
Date Fri, 06 Jun 2008 07:56:30 GMT

is there any way to index partial content of doc/xls/rtf . if its not
possible let me know.


ogjunk-nutch wrote:
> 
> I *think* you have to fetch the *full* content of MS Word docs (and PDFs
> and RTFs and ...) if you want parsers that handle those documents to be
> able to parse them.  A partial MS Word/PDF/RTF/... document is considered
> invalid/broken.  Try opening it with MS Word, for example -- it will not
> work.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
>> From: m.harig <m.harig@gmail.com>
>> To: nutch-dev@lucene.apache.org
>> Sent: Thursday, June 5, 2008 3:27:18 AM
>> Subject: Re: nutch file content limit
>> 
>> 
>> thanks
>> 
>> my situation is this.. i've 100 MS-WORD files . each has 15MB in size...
>> 
>> if i set file.content.limit as 5MB. when nutch goes for fetching it can't
>> parse the content. it says Can't handle as Microsoft document. and its
>> failed.. how do i index partial content of those documents. any1 help me
>> out
>> of this
>> 
>> 
>> this is my error
>> 
>> Can't be handled as Microsoft document. java.io.IOException: Cannot
>> remove
>> block[ 20839 ]; out of range
>> -- 
>> View this message in context: 
>> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Mime
View raw message