lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <>
Subject RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
Date Mon, 02 Apr 2007 21:50:44 GMT
Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
> I wouldn't worry about generalizing too much  
> to start with.  Once we have a couple collections then we can go that

> route.

My thoughts, too.

I've been looking at the Reuters stuff. It uncompressed the distribution
and then creates per-article files. I can't decide if I think that's a
good idea for Wikipedia. It's big (about 10G uncompressed) and has about
1.2M files (so I've heard; unverified).

On the one hand, creating separate per-article files is "clean" in that
when you then ingest, you only have disk i/o that's going to affect the
ingest performance (as opposed to, say, uncompressing/parsing). On the
other hand, that's a lot of disk i/o (compresses by about 5X) and a lot
of directory lookups.

Anybody have any opinions/relevant past experience?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message