lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Updated: (LUCENE-2537) FSDirectory.copy() impl is unsafe
Date Thu, 22 Jul 2010 12:20:52 GMT


Shai Erera updated LUCENE-2537:


I wrote a test which compares FileChannel API to intermediate buffer copies. The test runs
each method 3 times and reports the best time of each. It can be run w/ different file and
chunk sizes.

Here are the results of copying a 1GB file using different chunk sizes (the chunk is used
as the intermediate buffer size as well).

Machine spec:
* Linux, 64-bit (IBM) JVM
* 2xQuad (+hyper-threading) - 16 cores overall
* 16GB RAM

||Chunk Size||FileChannel||Intermediate Buffer||Diff||

For small buffer sizes, intermediate byte[] copies is preferable. However, FileChannel method
performs pretty much consistently, irregardless of the buffer size (except for the first run),
while the byte[] approach degrades a lot, as the buffer size increases.

I think, given these results, we can use the FileChannel method w/ a chunk size of 4 (or even
2) MB, to be on the safe side and don't eat up too much RAM?

> FSDirectory.copy() impl is unsafe
> ---------------------------------
>                 Key: LUCENE-2537
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>         Attachments:
> There are a couple of issues with it:
> # FileChannel.transferFrom documents that it may not copy the number of bytes requested,
however we don't check the return value. So need to fix the code to read in a loop until all
bytes were copied..
> # When calling addIndexes() w/ very large segments (few hundred MBs in size), I ran into
the following exception (Java 1.6 -- Java 1.5's exception was cryptic):
> {code}
> Exception in thread "main" Map failed
>     at
>     at
>     at
>     at
>     at org.apache.lucene.index.IndexWriter.addIndexes(
> Caused by: java.lang.OutOfMemoryError: Map failed
>     at Method)
>     at
>     ... 7 more
> {code}
> I changed the impl to something like this:
> {code}
> long numWritten = 0;
> long numToWrite = input.size();
> long bufSize = 1 << 26;
> while (numWritten < numToWrite) {
>   numWritten += output.transferFrom(input, numWritten, bufSize);
> }
> {code}
> And the code successfully adds the indexes. This code uses chunks of 64MB, however that
might be too large for some applications, so we definitely need a smaller one. The question
is how small so that performance won't be affected, and it'd be great if we can let it be
configurable, however since that API is called by other API, such as addIndexes, not sure
it's easily controllable.
> Also, I read somewhere (can't remember now where) that on Linux the native impl is better
and does copy in chunks. So perhaps we should make a Linux specific impl?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message