I finally got around to writing a testcase to verify the numbers I presented. The following testcase and results are for the lowest level disk operations. On my machine reading from the cache, vs. going to disk (even when the data is in the OS cache) is 30%-40% faster. Since Lucene makes extensive use of disk IO and often reads the same data (e.g. reading the terms), a localized user-level cache can provide significant performance benefits.
Using a 4mb file (so I could be "guarantee" the disk data would be in the OS cache as well), the test shows the following results.
Most of the CPU time is actually used during the synchronization with multiple threads. I hacked together a version of MemoryLRUCache that used a ConcurrentHashMap from JDK 1.5, and it was another 50% faster ! At a minimum, if the ReadWriteLock class was modified to use the 1.5 facilities some significant additional performance gains should be realized.

filesize is 4194304

non-cached time = 10578, avg = 0.010578

non-cached threaded (3 threads) time = 32094, avg = 0.010698

cached time = 6125, avg = 0.006125

cache hits 996365

cache misses 3635

cached threaded (3 threads) time = 20734, avg = 0.0069113333333333336

cache hits 3989089

cache misses 10911

When using the shared test (which is more like the lucene usage, since a single "file" is shared by multiple threads), the difference is even more dramatic with multiple threads (since the cache size is effectively reduced by the number of threads). This test also shows the value of using multiple file handles when using multiple threads to read a single file (rather than using a shared file handle).

filesize is 4194304

non-cached time = 10594, avg = 0.010594

non-cached threaded (3 threads) time = 42110, avg = 0.014036666666666666

cached time = 6047, avg = 0.006047

cache hits 996827

cache misses 3173

cached threaded (3 threads) time = 20079, avg = 0.006693

cache hits 3995776

cache misses 4224