lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Carlson <>
Subject Re: Normalization of Documents
Date Wed, 10 Apr 2002 15:17:19 GMT
I have noticed the same issue.

>From what I understand, this is both the way it should work and a problem.
Shorter documents which have a given term, should be more relevant because
more of the document is about that term (i.e the term takes a greater % of
the document). However, when there are documents of completely different
sizes (i.e. 20 words vs. 2000 words) this assumption doesn't hold up very

One solution I've heard is to extract the concepts of the documents, then
search on those. The concepts are still difficult to extract if the document
is too short, but it may provide a way to standardize documents. I have been
lazily looking for an open source, academic concept extractor, but I haven't
found one. There are companies like Semio and ActiveNavigation which provide
this service for an expense fee.

Let me know if you find anything or have other ideas.


On 4/9/02 10:07 PM, "Melissa Mifsud" <> wrote:

> Hi,
> Documents which are shorter in length always seem to score higher in Lucene. I
> was under the impression that the nornalization factors in the scoring
> function used by Lucene would improve this, however, after a couple of
> experiments, the short documents still always score the highest.
> Does anyone have any ideas as to how it is possible to make lengthier
> documents score higher?
> Also, I would like a way to boost documents according to the amount of
> in-links this document has.
> Has anyone implemented a type of Document.setBoost() method?
> I found a thread in the lucene-dev mailinglist where Doug Cutting mentions
> that this would be a great feature to add to Lucene. No one followed his
> email.
> Melissa.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message