lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: looking for a large test corpus for a lucene presentation
Date Wed, 07 Apr 2004 10:58:44 GMT
Matt Quail wrote:

> Hi all,
> I'm doing a presentation to my local JUG on Lucene, and I'm looking for 
> a "good" set of documents to use as a demonstration.
> Ideally it would be:
> 1) large (10,000 plus?).
> 2) contain some metadata besides "body" (like author, date, primarykey, 
> etc).
> 3) freely available.
> I was going to use the data from the previous Google programming 
> contest, but it doesn't seem to be available.
> If I can't find anything satisfactory, I'll probably:
> - generate a fake whitepages phonebook
> - grab documents from project Gutenberg
> My preference is for some "real" data, but I'm happy to generate fake 
> data if no-one has any better ideas.

how about, and specifically content.rdf.u8.gz? You 
can find a parser/converter in Nutch for this format, but it's trivial 
to do it yourself - so long as you use SAX... (unless, of course, you 
run it on Cray or something.. :-) )

Best regards,
Andrzej Bialecki

Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
FreeBSD developer (

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message