lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil Ε’ORIN <torin...@gmail.com>
Subject Re: Making lucene indexing multi threaded
Date Tue, 03 Sep 2013 06:35:40 GMT
Don't commit after adding each and every document.




On Tue, Sep 3, 2013 at 7:20 AM, nischal reddy <nischal.srinivas@gmail.com>wrote:

> Hi,
>
> Some more update on my progress,
>
> i have multithreaded indexing in my application, i have used thread pool
> executor and used a pool size of 4 but had a very slight increase in the
> performace very negligible, still it is taking around 20 minutes of time to
> index around 30k files,
>
> Some more info on what am i doing
>
> method where indexing is done:
>
> private void indexAllFields(IResource resource) {
>         IFile ifile = (IFile) resource;
>         File file = resource.getLocation().toFile();
>         Document doc = new Document();
>         try {
>             doc.add(new StringField(FIELD_FILE_PATH,
> getIndexFilePath(resource), Store.YES));
>             doc.add(new StringField(FIELD_FILE_TYPE,
> ifile.getFileExtension().toLowerCase(), Store.YES));
>             //indexContents(file, doc);
>             /**
>              * Calling updateDocument will make sure that only one indexed
> document will be added per IFile.
>              * Because this method deletes any existing document with the
> given Term and adds a new document.
>              * This Fixes Sonic00039677
>              */
>             //iWriter.addDocument(doc);
>             iWriter.updateDocument(new Term(FIELD_FILE_PATH,
> getIndexFilePath(resource)), doc);
>             iWriter.commit();
>         } catch (FileNotFoundException e) {
>
>         } catch (IOException e) {
>
>         }
>     }
>
>
> //Runnable to schedule a indexing job
> class IndexingJob implements Runnable{
>
>         private IResource resource;
>
>         public IndexingJob(IResource resource) {
>             this.resource = resource;
>         }
>
>         @Override
>         public void run() {
>             indexAllFields(resource);
>         }
>
>     }
>
> //method to queue files to be indexed
>
> void doJob(){
>
>  ThreadPoolExecutor executor = new ThreadPoolExecutor(4, 6, Long.MAX_VALUE,
> TimeUnit.SECONDS, workQueue);
>                         for (IResource iResource : files) {
>                             addToIndexQueue(iResource,executor);
>                             //updateBasedOnTimeStamp(iResource);
>                         }
>                         executor.shutdown();
>
>                         try {
>                             executor.awaitTermination(Long.MAX_VALUE,
> TimeUnit.SECONDS);
>                         } catch (InterruptedException e) {
>                             // TODO Auto-generated catch block
>                             e.printStackTrace();
>                         }
>
> }
>
> Still with the multi threaded approach it is taking very long.
>
> TIA,
> Nischal Y
>
>
>
>
> On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Stop. Back up. Test. <G>....
> >
> > The very _first_ thing I'd do is just comment out the bit that
> > actually indexes the content. I'm guessing you have some
> > loop like:
> >
> > while (more files) {
> >   read the file
> >    transform the data
> >    create a Lucene document
> >    index the document
> > }
> >
> > Just comment out the "index the document" line and see how
> > long _that_ takes. 9 times out of 10, the bottleneck is here.
> > As a comparison, I can index 3-4K docs/second on my laptop.
> > This is using Solr and is the Wikipedia dump so the docs
> > are several K each.
> >
> > So, if you're going to multi-thread, you'll probably want to
> > multi-thread the acquisition of the data and feed that
> > through a separate thread that actually does the indexing,
> > you don't want multiple IndexWriters active at once.
> >
> > FWIW,
> > Erick
> >
> >
> >
> > On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
> > <nischal.srinivas@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > I am thinking to make my lucene indexing multi threaded, can someone
> > throw
> > > some light on the best approach to be followed for achieving this.
> > >
> > > I will give short gist about what i am trying to do, please suggest me
> > the
> > > best way to tackle this.
> > >
> > > What am i trying to do?
> > >
> > > I am building an index for files (around 30000 files), and later will
> use
> > > this index to search the contents of the files. The usual sequential
> > > approach works fine but is taking humungous amount of time (around 30
> > > minutes is this the expected time or am i screwing up things
> somewhere?).
> > >
> > > What am i thinking to do?
> > >
> > > So to improve the performance i am thinking to make my application
> > > multithreaded
> > >
> > > Need suggestions :)
> > >
> > > Please suggest me best ways to do this and normally how long does
> lucene
> > > take to index 30k files?
> > >
> > > Please suggest me some links of examples (or probably best practices
> for
> > > multithreading lucene) for making my application more robust.
> > >
> > > TIA,
> > > Nischal Y
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message