lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nischal reddy <nischal.srini...@gmail.com>
Subject Re: Making lucene indexing multi threaded
Date Tue, 03 Sep 2013 04:20:55 GMT
Hi,

Some more update on my progress,

i have multithreaded indexing in my application, i have used thread pool
executor and used a pool size of 4 but had a very slight increase in the
performace very negligible, still it is taking around 20 minutes of time to
index around 30k files,

Some more info on what am i doing

method where indexing is done:

private void indexAllFields(IResource resource) {
        IFile ifile = (IFile) resource;
        File file = resource.getLocation().toFile();
        Document doc = new Document();
        try {
            doc.add(new StringField(FIELD_FILE_PATH,
getIndexFilePath(resource), Store.YES));
            doc.add(new StringField(FIELD_FILE_TYPE,
ifile.getFileExtension().toLowerCase(), Store.YES));
            //indexContents(file, doc);
            /**
             * Calling updateDocument will make sure that only one indexed
document will be added per IFile.
             * Because this method deletes any existing document with the
given Term and adds a new document.
             * This Fixes Sonic00039677
             */
            //iWriter.addDocument(doc);
            iWriter.updateDocument(new Term(FIELD_FILE_PATH,
getIndexFilePath(resource)), doc);
            iWriter.commit();
        } catch (FileNotFoundException e) {

        } catch (IOException e) {

        }
    }


//Runnable to schedule a indexing job
class IndexingJob implements Runnable{

        private IResource resource;

        public IndexingJob(IResource resource) {
            this.resource = resource;
        }

        @Override
        public void run() {
            indexAllFields(resource);
        }

    }

//method to queue files to be indexed

void doJob(){

 ThreadPoolExecutor executor = new ThreadPoolExecutor(4, 6, Long.MAX_VALUE,
TimeUnit.SECONDS, workQueue);
                        for (IResource iResource : files) {
                            addToIndexQueue(iResource,executor);
                            //updateBasedOnTimeStamp(iResource);
                        }
                        executor.shutdown();

                        try {
                            executor.awaitTermination(Long.MAX_VALUE,
TimeUnit.SECONDS);
                        } catch (InterruptedException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        }

}

Still with the multi threaded approach it is taking very long.

TIA,
Nischal Y




On Mon, Sep 2, 2013 at 8:07 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> Stop. Back up. Test. <G>....
>
> The very _first_ thing I'd do is just comment out the bit that
> actually indexes the content. I'm guessing you have some
> loop like:
>
> while (more files) {
>   read the file
>    transform the data
>    create a Lucene document
>    index the document
> }
>
> Just comment out the "index the document" line and see how
> long _that_ takes. 9 times out of 10, the bottleneck is here.
> As a comparison, I can index 3-4K docs/second on my laptop.
> This is using Solr and is the Wikipedia dump so the docs
> are several K each.
>
> So, if you're going to multi-thread, you'll probably want to
> multi-thread the acquisition of the data and feed that
> through a separate thread that actually does the indexing,
> you don't want multiple IndexWriters active at once.
>
> FWIW,
> Erick
>
>
>
> On Mon, Sep 2, 2013 at 10:13 AM, nischal reddy
> <nischal.srinivas@gmail.com>wrote:
>
> > Hi,
> >
> > I am thinking to make my lucene indexing multi threaded, can someone
> throw
> > some light on the best approach to be followed for achieving this.
> >
> > I will give short gist about what i am trying to do, please suggest me
> the
> > best way to tackle this.
> >
> > What am i trying to do?
> >
> > I am building an index for files (around 30000 files), and later will use
> > this index to search the contents of the files. The usual sequential
> > approach works fine but is taking humungous amount of time (around 30
> > minutes is this the expected time or am i screwing up things somewhere?).
> >
> > What am i thinking to do?
> >
> > So to improve the performance i am thinking to make my application
> > multithreaded
> >
> > Need suggestions :)
> >
> > Please suggest me best ways to do this and normally how long does lucene
> > take to index 30k files?
> >
> > Please suggest me some links of examples (or probably best practices for
> > multithreading lucene) for making my application more robust.
> >
> > TIA,
> > Nischal Y
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message