lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 10:28:57 GMT

I haven't actually tried it, but I think very likely the current code  
in contrib/benchmark might be able to extract non-English Wikipedia  
dump as well?

Have a look at contrib/benchmark/conf/extractWikipedia.alg: I think  
if you just change the docs.file to reference your downloaded XML  
file it could just work?


Otis Gospodnetic wrote:

> Hi,
> I need to index a Wikipedia dump.  I know there is code in contrib/ 
> benchmark for indexing *English* Wikipedia for benchmarking  
> purposes.  However, I'd like to index a non-English dump, and I  
> actually don't need it for benchmarking, I just want to end up with  
> a Lucene index.
> Any suggestions where I should start?  That is, can anything in  
> contrib/benchmark already do this, or is there anything there that  
> I should use as a starting point?  As opposed to writing my own  
> Wikipedia XML dump parser+indexer.
> Thanks,
> Otis
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message