lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Ranjan <>
Subject How is the term frequency calculated if I have to add a user-generated document.
Date Fri, 19 Apr 2013 06:12:03 GMT
I am a student and studying the functionality of Lucene for my project work.

If I have to add a new user-generated document in lucene with a term having
a particular frequency just like any text file, how do I do it?
For eg, say I have to add the following documents analyzed from an image

doc1 =
{ contents field:
{"red (X15 times) blue(X10 times)"} ,
  name field:

doc2 =
{ contents field:
{"red (X10 times) blue(X18 times)"} ,
  name field:

Now when indexing, I should have term freq for "red" as 15 for doc1 and 10
for doc2 ?
The documents doc1 and doc2 can be indexed alongwith the normal text files
if only we can update the frequencies manually. Here I need to have
frequencies indexed as well

The DocDelta example provided on this link (
says :

FreqFile (.frq) --> Header, <TermFreqs, SkipData> TermCount
Header --> CodecHeader
TermFreqs --> <TermFreq> DocFreq
TermFreq --> DocDelta[, Freq?]
SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel>
SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
SkipDatum -->
DocDelta,Freq,DocSkip,PayloadLength,OffsetLength,FreqSkip,ProxSkip --> VInt
SkipChildLevelPointer --> VLong

"For example, the TermFreqs for a term which occurs once in document seven
and three times in document eleven, with frequencies indexed, would be the
following sequence of VInts:

15, 8, 3

If frequencies were omitted (FieldInfo.IndexOptions.DOCS_ONLY) it would be
this sequence of VInts instead:


So what should be the DocDelta values for doc1 and doc2 and how? Please
provide any other useful links.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message