lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ariel <isaacr...@gmail.com>
Subject Re: Why is lucene so slow indexing in nfs file system ?
Date Thu, 10 Jan 2008 13:45:29 GMT
Thanks all you for yours answers, I going to change a few things in my
application and make tests.
One thing I haven't find another good pdfToText converter like pdfBox Do you
know any other faster ?
Greetings
Thanks for yours answers
Ariel

On Jan 9, 2008 11:08 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com>
wrote:

> Ariel,
>
> I believe PDFBox is not the fastest thing and was built more to handle all
> possible PDFs than for speed (just my impression - Ben, PDFBox's author
> might still be on this list and might comment).  Pulling data from NFS to
> index seems like a bad idea.  I hope at least the indices are local and not
> on a remote NFS...
>
> We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
> and indexing overNFS was slooooooow.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Ariel <isaacrc82@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Wednesday, January 9, 2008 2:50:41 PM
> Subject: Why is lucene so slow indexing in nfs file system ?
>
> Hi:
> I have seen the post in
> http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html
>  and
> I am implementing a similar application in a distributed enviroment, a
> cluster of nodes only 5 nodes. The operating system I use is
>  Linux(Centos)
> so I am using nfs file system too to access the home directory where
>  the
> documents to be indexed reside and I would like to know how much time
>  an
> application spends to index a big amount of documents like 10 Gb ?
> I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
>  every
> nodes, LAN: 1Gbits/s.
>
> The problem I have is that my application spends a lot of time to index
>  all
> the documents, the delay to index 10 gb of pdf documents is about 2
>  days (to
> convert pdf to text I am using pdfbox) that is of course a lot of time,
> others applications based in lucene, for instance ibm omnifind only
>  takes 5
> hours to index the same amount of pdfs documents. I would like to find
>  out
> why my application has this big delay to index, any help is welcome.
> Dou you know others distributed architecture application that uses
>  lucene to
> index big amounts of documents ? How long time it takes to index ?
> I hope yo can help me
> Greetings
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message