lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Smith" <>
Subject Managing a large archival (and constantly changing) database
Date Thu, 06 Jul 2006 17:47:47 GMT
I've been asked to do a project which provides full-text search for a
large database of articles.  The expectation is that most of the
articles are fairly small (<2k bytes).  There will be an initial
population of around 400,000 articles.  There will then be approximately
2000 new articles added each day (they need to be added in "real time"
(within a few minutes of arrival), but will be spread out during the
day).  So, roughly another 700,000 articles each year.


I've read enough to believe that having a lucene database of several
million articles is doable.  And, adding 2000 articles per day wouldn't
seem to be that many.  My concern is the real-time nature of the
application.  I'm a bit nervous (perhaps without justification) at
simply growing one monolithic lucene database.  Should there be a crash,
the database will be unusable and I'll have to rebuild from scratch
(which, based on my experience, would be hours of time).


Some of my thoughts were:

1)     having monthly databases and using MultiSearcher to search across
them.  That way my exposure for a corrupted database is limited to this
month's database.  This would also seem to give me somewhat better
control--meaning a) if the search was generating lots of hits, I could
display the results a month at a time and not bury them with output.  It
would also spread their search CPU out better and not prevent other
individuals from doing a search.  If there were very few results, I
could sleep between each month's search and again, not lock everyone
else out from searches.

2)     Have a "this month's" searchable and an "everything else"
searchable.  At the beginning of each month, I would consolidate the
previous month's database into the "everything else" searchable.  This
would give more consistent results for relevancy ranked searches.  But,
it means that a bad search could return lots of results.


Has anyone else dealt with a similar problem?  Am I expecting too much
from Lucene running on a single machine (or should I be looking at
Hadoop?).  Any comments or links to previous discussions on this topic
would be appreciated.





  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message