lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Ranjan <gaurav.ranjan.i...@gmail.com>
Subject How is the term frequency calculated if I have to add a user-generated document.
Date Fri, 19 Apr 2013 06:12:03 GMT
I am a student and studying the functionality of Lucene for my project work.

If I have to add a new user-generated document in lucene with a term having
a particular frequency just like any text file, how do I do it?
For eg, say I have to add the following documents analyzed from an image

doc1 =
{ contents field:
{"red (X15 times) blue(X10 times)"} ,
  name field:
{"doc1"}
}

doc2 =
{ contents field:
{"red (X10 times) blue(X18 times)"} ,
  name field:
{"doc2"}
}

Now when indexing, I should have term freq for "red" as 15 for doc1 and 10
for doc2 ?
The documents doc1 and doc2 can be indexed alongwith the normal text files
if only we can update the frequencies manually. Here I need to have
frequencies indexed as well
(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS).


The DocDelta example provided on this link (
http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html?is-external=true)
says :

FreqFile (.frq) --> Header, <TermFreqs, SkipData> TermCount
Header --> CodecHeader
TermFreqs --> <TermFreq> DocFreq
TermFreq --> DocDelta[, Freq?]
SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel>
<SkipDatum>
SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1))
SkipDatum -->
DocSkip,PayloadLength?,OffsetLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
DocDelta,Freq,DocSkip,PayloadLength,OffsetLength,FreqSkip,ProxSkip --> VInt
SkipChildLevelPointer --> VLong


"For example, the TermFreqs for a term which occurs once in document seven
and three times in document eleven, with frequencies indexed, would be the
following sequence of VInts:

15, 8, 3

If frequencies were omitted (FieldInfo.IndexOptions.DOCS_ONLY) it would be
this sequence of VInts instead:

7,4"

So what should be the DocDelta values for doc1 and doc2 and how? Please
provide any other useful links.

Thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message