lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Why is lucene so slow indexing in nfs file system ?
Date Thu, 10 Jan 2008 19:59:11 GMT
Ariel,
 
Comments inline.


----- Original Message ----
From: Ariel <isaacrc82@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, January 10, 2008 10:05:28 AM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

In a distributed enviroment the application should make an exhaustive
 use of
the network and there is not another way to access to the documents in
 a
remote repository but accessing in nfs file system.

OG: What about SAN connected over FC for example?

One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I
 have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it
 with
the central index(the central index is in nfs file system), is that
 correct?

OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing
for you.  Make good use of your RAM and use 2.3 which gives you more control over RAM use
during indexing.  Parallelizing indexing over multiple machines and merging at the end is
faster, so that's a good approach.  Also, if your boxes have multiple CPUs write your code
so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document)
to keep the CPUs fully utilized.

OG: Oh, something faster than PDFBox?  There is (can't remember the name now... itextstream
or something like that?), though it may not be free like PDFBox.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


On Jan 10, 2008 8:45 AM, Ariel <isaacrc82@gmail.com> wrote:

> Thanks all you for yours answers, I going to change a few things in
 my
> application and make tests.
> One thing I haven't find another good pdfToText converter like pdfBox
 Do
> you know any other faster ?
> Greetings
> Thanks for yours answers
> Ariel
>
>
> On Jan 9, 2008 11:08 PM, Otis Gospodnetic
 <otis_gospodnetic@yahoo.com>
> wrote:
>
> > Ariel,
> >
> > I believe PDFBox is not the fastest thing and was built more to
 handle
> > all possible PDFs than for speed (just my impression - Ben,
 PDFBox's author
> > might still be on this list and might comment).  Pulling data from
 NFS to
> > index seems like a bad idea.  I hope at least the indices are local
 and not
> > on a remote NFS...
> >
> > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
 one)
> > and indexing overNFS was slooooooow.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Ariel <isaacrc82@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > Subject: Why is lucene so slow indexing in nfs file system ?
> >
> > Hi:
> > I have seen the post in
> >
 http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html
> >  and
> > I am implementing a similar application in a distributed
 enviroment, a
> > cluster of nodes only 5 nodes. The operating system I use is
> >  Linux(Centos)
> > so I am using nfs file system too to access the home directory
 where
> >  the
> > documents to be indexed reside and I would like to know how much
 time
> >  an
> > application spends to index a big amount of documents like 10 Gb ?
> > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
 in
> >  every
> > nodes, LAN: 1Gbits/s.
> >
> > The problem I have is that my application spends a lot of time to
 index
> >  all
> > the documents, the delay to index 10 gb of pdf documents is about 2
> >  days (to
> > convert pdf to text I am using pdfbox) that is of course a lot of
 time,
> > others applications based in lucene, for instance ibm omnifind only
> >  takes 5
> > hours to index the same amount of pdfs documents. I would like to
 find
> >  out
> > why my application has this big delay to index, any help is
 welcome.
> > Dou you know others distributed architecture application that uses
> >  lucene to
> > index big amounts of documents ? How long time it takes to index ?
> > I hope yo can help me
> > Greetings
> >
> >
> >
> >
> >
 ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message