lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Schilling <>
Subject Re: new to lucene, non standard index
Date Thu, 05 May 2011 23:00:12 GMT
Hey Mike,

My only concern is that I am replacing a large number of fields inside of a Document with
a (very large ~50e6) number of Documents.  Will I not run into the same memory issues?  Or
do I create only one doc object and reuse it?  With so many Doc/Token pairs, won't searching
the index take a lot more time?

Thanks for your help,

On May 5, 2011, at 3:11 PM, Mike Sokolov wrote:

> I think the solution I gave you will work.  The only problem is if a token appears twice
in the same doc:
> doc1 has foo with two different sets of weights and frequencies...
> but I think you're saying that doesn't happen
> On 05/05/2011 06:09 PM, Chris Schilling wrote:
>> Hey Mike,
>> Let me clarify:
>> The tokens are not unique.  Let's say doc1 contains the token
>> foo and has the properties weight1 = 0.75, weight2 = 0.90, frequency = 10
>> Now, let's say doc2 also contains the token
>> foo with properties: weight1 = 0.8, weight2 = 0.75, frequency = 5
>> Now, I want to search for all the documents that contain foo, but I want them sorted
by frequency.
>> Then, I would have doc1, doc2.
>> Now, I want to search for all the documents that contain foon, but I want them sorted
by weight1.
>> Then, I would have doc2, doc1
>> Does that clarify?
>> On May 5, 2011, at 3:01 PM, Mike Sokolov wrote:
>>> Are the tokens unique within a document? If so, why not store a document for
every doc/token pair with fields:
>>> id (doc#/token#)
>>> doc-id (doc#)
>>> token
>>> weight1
>>> weight2
>>> frequency
>>> Then search for token, sort by weight1, weight2 or frequency.
>>> If the token matches are unique within a document you will only get each document
listed once.  If they aren't unique, it's not clear what you want to sort by anyway....
>>> -Mike
>>> On 05/05/2011 04:12 PM, Chris Schilling wrote:
>>>> Hi,
>>>> I am trying to figure out how to solve this problem:
>>>> I have about 500,000 files that I would like to index, but the files are
structured.  So, each file has the following layout:
>>>> doc1
>>>> token1, weight11, frequency1, weight21
>>>> token2, weight12, frequency2, weight22
>>>> .
>>>> .
>>>> .
>>>> etc for 500,000 docs.
>>>> Basically, I would like to index the tokens for each doc.  When I search
for a token, I would like to be able to return the top docs sorted by weight1, frequency,
or weight2.
>>>> So, in my naive setup, I loop through the files in the directory, then I
loop through the lines of the file.   In side of the loop through each file, I call this function:
>>>> 	public Document processKeywords(Document doc, String keyword, Float weight1,
Float weight2, Integer frequency) throws Exception {
>>>> 			Document doc = new Document();
>>>> 			doc.add(new Field("keywords", keyword, Field.Store.NO, Field.Index.ANALYZED));
>>>> 			doc.add(new NumericField(keyword+"weight1", Field.Store.YES, true).setFloatValue(weight1));
>>>> 			doc.add(new NumericField(keyword+"weight2", Field.Store.YES, true).setFloatValue(weight2));
>>>> 			doc.add(new NumericField(keyword+"frequency", Field.Store.YES, true).setFloatValue(frequency));
>>>> 			return doc;
>>>> 	}
>>>> So, for each token, I create 3 new fields each time. Notice how I am trying
to index the keyword in the "keywords" field.  For the weights and frequency, I create a new
field with a name based on the keyword.  On average, I have 100 tokens per document, so each
document will have about 300 distinct fields.
>>>> When running my program, the lucene portion eats up tons of memory and when
it gets to the max alloted by the JVM (I have tried allowing up to 4 Gb), the program slows
to a crawl.  I assume it is spending all of its time in garbage collection due to all these
>>>> My code above seems like a very hacky way of accomplishing what I want (sorting
documents based on keyword search using different numeric fields associated with that keyword).
>>>> FYI, here is the main search code, where q is the token I am searching for
and sortby is the field I want to use to sort.  I setup a QP to search for the keyword in
the "keywords" field.  Then, I can extract the stats that I indexed for the given query keyword.
>>>> 	private static final QueryParser parser = new QueryParser(Version.LUCENE_30,
"keywords", new StandardAnalyzer(Version.LUCENE_30));
>>>> 	public void search(String q, String sortby) throws IOException, ParseException
>>>> 		Query query = parser.parse(q);
>>>> 		long start = System.currentTimeMillis();
>>>> 		TopDocs hits =, null, 10, new Sort(new SortField(q+"sortby",
SortField.FLOAT, true)));
>>>> 		long end = System.currentTimeMillis();
>>>> 		System.out.println("Found " + hits.totalHits +
>>>> 				" document(s) (in " + (end - start) +
>>>> 				" milliseconds) that matched query '" +
>>>> 				q + "':");
>>>> 		for(ScoreDoc scoreDoc : hits.scoreDocs) {
>>>> 			Document doc =;
>>>> 			String hash = doc.get("hash");
>>>> 			System.out.println(hash + " " + doc.get(q+"sortby") + " " + hash);
>>>> 		}
>>>> 	}
>>>> I am pretty new to Lucene, so I hope this makes sense.  I tried to pare my
problem down as much as possible.  Like I said, the main problem I am running into is that
after processing about 30000 documents, the indexing slows to a crawl and seems to spend all
of its time in the garbage collector.  I am looking for a more efficient/effective way of
solving this problem.  Code tidbits would help, but are not necessary :)
>>>> Thanks for your help,
>>>> Chris S.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message