lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Why is lucene so slow indexing in nfs file system ?
Date Fri, 11 Jan 2008 05:30:47 GMT
2.3 is in the process of being released.  Give it another week to 10 days and it will be out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ariel <isaacrc82@gmail.com>
To: java-user@lucene.apache.org
Sent: Thursday, January 10, 2008 6:26:44 PM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with
 "SAN"
and "FC"?

Another thing, I have visited the lucene home page and there is not
 released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic <otis_gospodnetic@yahoo.com>
wrote:

> Ariel,
>
> Comments inline.
>
>
> ----- Original Message ----
> From: Ariel <isaacrc82@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, January 10, 2008 10:05:28 AM
> Subject: Re: Why is lucene so slow indexing in nfs file system ?
>
> In a distributed enviroment the application should make an exhaustive
>  use of
> the network and there is not another way to access to the documents
 in
>  a
> remote repository but accessing in nfs file system.
>
> OG: What about SAN connected over FC for example?
>
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I
>  have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it
>  with
> the central index(the central index is in nfs file system), is that
>  correct?
>
> OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it
 will
> do in-memory thing for you.  Make good use of your RAM and use 2.3
 which
> gives you more control over RAM use during indexing.  Parallelizing
 indexing
> over multiple machines and merging at the end is faster, so that's a
 good
> approach.  Also, if your boxes have multiple CPUs write your code so
 that it
> has multiple worker threads that do indexing and feed docs to
> IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
>
> OG: Oh, something faster than PDFBox?  There is (can't remember the
 name
> now... itextstream or something like that?), though it may not be
 free like
> PDFBox.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> On Jan 10, 2008 8:45 AM, Ariel <isaacrc82@gmail.com> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in
>  my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like
 pdfBox
>  Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
>  <otis_gospodnetic@yahoo.com>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to
>  handle
> > > all possible PDFs than for speed (just my impression - Ben,
>  PDFBox's author
> > > might still be on this list and might comment).  Pulling data
 from
>  NFS to
> > > index seems like a bad idea.  I hope at least the indices are
 local
>  and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall
 which
>  one)
> > > and indexing overNFS was slooooooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Ariel <isaacrc82@gmail.com>
> > > To: java-user@lucene.apache.org
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
>
  http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed
>  enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory
>  where
> > >  the
> > > documents to be indexed reside and I would like to know how much
>  time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb
 ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512
 Mb
>  in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
>  index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about
 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
>  time,
> > > others applications based in lucene, for instance ibm omnifind
 only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to
>  find
> > >  out
> > > why my application has this big delay to index, any help is
>  welcome.
> > > Dou you know others distributed architecture application that
 uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index
 ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > >
>
  ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message