lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lu" <>
Subject Re: Indexing Wikipedia dumps
Date Wed, 12 Dec 2007 06:55:02 GMT
For a quick java approach, give yourself 3 minutes and try to use
DBSight to access the database. You can simply use "select * from
mw_searchindex" as a starting point. It'll build the index for you.
However, you may need to pluggin your custom analyzer for media wiki's
format(Or maybe not).

Chris Lu
Instant Scalable Full-Text Search On Any Database/Application
Lucene Database Search in 3 minutes:
DBSight customer (remain anonymous per request) got 2.6 Million Euro funding!

On Dec 11, 2007 9:35 PM, Otis Gospodnetic <> wrote:
> Hi,
> I need to index a Wikipedia dump.  I know there is code in contrib/benchmark for indexing
*English* Wikipedia for benchmarking purposes.  However, I'd like to index a non-English dump,
and I actually don't need it for benchmarking, I just want to end up with a Lucene index.
> Any suggestions where I should start?  That is, can anything in contrib/benchmark already
do this, or is there anything there that I should use as a starting point?  As opposed to
writing my own Wikipedia XML dump parser+indexer.
> Thanks,
> Otis
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message