Herb,
Any one game ... ?
No takers? I would be very interested, but maybe beyond what can be
posted in a mail list. I'd be equally interested in any references you
may have.
As we are on this subject how does LSI and the similar CNG (context
network graph) fit into the model used by lucene. Could lucene be
massaged to implement different mathematical models of search and
retrieval, if so how modular are the core functions?
Adam Saltiel
> Original Message
> From: Chong, Herb [mailto:HChong3@bloomberg.com]
> Sent: Thursday, December 04, 2003 1:53 PM
> To: Lucene Users List
> Subject: RE: Probabilistic Model in Lucene  possible?
>
> not all tf/idf variants are probabilistic models, but a great many are
if
> the term weights are probabilities. if we just take straight,
unmodified
> Term Frequency in a document, Inverse Document Frequency in the
corpus,
> and the Term Frequency in the query as 1, you are in fact comparing
the
> statistical properties of the query against the statistical properties
of
> the query. they are probabilities you are comparing. i can't think of
many
> papers that come right out and say it, but if you look at an
individual
> term weight and can interpret it as a genuine probability, the vector
> space model based on the weights is a probabilistic model. the
derivation
> is relatively straight forward to show it, if you have the right
general
> model to start with. once you start throwing in ad hoc normalizations,
> then things get out of whack and it's not longer a probabilistic
model.
>
> the implementations that i have done are with a former company and
that
> means secret and protected by various intellectual property rights.
> however, i can sketch here the general approach one has to take and an
> outline of the derivation that unifies probabilistic models with
vector
> space models and at the same time incorporate pairwise interterm
> correlation. in fact, the pairwise interterm correlations are a
> fundamental assumption. once you do all this, you can show that the
> traditional vector space model is a special case of a pairwise
interterm
> correlation model. for those that are interested in advanced matrix
> algebra and some basic statistics, it should be very interesting. if
only
> i had a published paper, i would post it. unfortunately, what i have
is
> very obtuse because it's protected. the only paper that started out
was
> submitted to SIGIR but rejected by all but one referee. that one
thought
> this was a tremendous unification of the two methods, but academic
> journals being what they are, when 4 out of 5 referees can't
understand
> the paper, it doesn't get published. i may brush it off and enlarge
into a
> much longer paper for the Journal of IR, but once again, unless you
are
> comfortable with probability theory and matrix theory, you are not
going
> to follow it.
>
> so, who is game for a tutorial on the derivation?
>
> Herb...
>
> Original Message
> From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
> Sent: Thursday, December 04, 2003 5:09 AM
> To: Lucene Users List
> Subject: AW: Probabilistic Model in Lucene  possible?
>
>
>
> Hi Herb,
>
> thank you for your insights.
>
> >>
> but by most accepted definitions, the tf/idf model in Lucene is a
> probabilistic model.
> >>
>
> Can you send some pointers to help me understand that? Are all TF/IDF
> variants
> probabilistic models? If so, what makes any model a nonprobabilistic
one?
> If you claim that TF/IDF is probabilistic, then the plain cosine (an
> extreme
> form of TF/IDF, with IDF for all terms being considered constant) of
VSM
> would
> also be a probabilistic model.
>
> >>
> it's got strange normalizations though that doesn't allow comparisons
of
> rank values across queries.
> >>
>
> Lucene's internal ranking sometimes returns values > 1.0, these are
then
> normalized to 1.0,
> adjusting other rankings accordingly. While I have nothing to say
against
> this  it's a hack,
> but useful  it makes comparing the rank values across queries really
> difficult. It's like
> using different scales whenever you measure something different, and
then
> you do not tell
> anyone about it.
>
> >>
> it isn't terribly hard to make a normalized probabilistic model that
> allows comparing of document scores across queries and assign a
meaning to
> the score. i've done it.
> >>
>
> Stop bragging, send us your Similarity implementation :)
>
> Regards,
>
> Karsten
>
>
> Ursprüngliche Nachricht
> Von: Chong, Herb [mailto:HChong3@bloomberg.com]
> Gesendet: Mittwoch, 3. Dezember 2003 23:01
> An: Lucene Users List
> Betreff: RE: Probabilistic Model in Lucene  possible?
>
>
> i think i am missing the original question, but by most accepted
> definitions, the tf/idf model in Lucene is a probabilistic model. it's
got
> strange normalizations though that doesn't allow comparisons of rank
> values across queries.
>
> it isn't terribly hard to make a normalized probabilistic model that
> allows comparing of document scores across queries and assign a
meaning to
> the score. i've done it. however, that means abandoning idf and
keeping
> actual term frequencies for each document and document size. once you
> normalize this way, you can intermingle document scores from different
> queries and different corpora and make statements about the absolute
value
> of the score. it also leads directly into the discussion we had
earlier
> about interterm correlations and how to handle them properly since the
> full interterm probabilistic model has as a special case the
traditional
> tf/idf model. interjecting Boolean conditions and boost makes the
model
> much more complicated.
>
> Herb....
>
> Original Message
> From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]
> Sent: Wednesday, December 03, 2003 4:51 PM
> To: Lucene Users List
> Subject: AW: Probabilistic Model in Lucene  possible?
>
> >>
> I would highly appreciate it if the experts here (especially Karsten
or
> Chong) look at my idea and tell me if this would be possible.
> >>
>
> Sorry, I have no idea about how to use a probabilistic approach with
> Lucene, but if anyone does so, I would like to know, too.
>
> I am currently puzzled by a related question: I would like to know if
> there are any approaches to get a confidence value for relevance
> rather than a ranking. I.e., it would be nice to have a ranking
> weight whose value has some kind of semantics such that we could
> compare results from different queries. Can probabilistic approches
> do anything like this?
>
> 
> To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
> For additional commands, email: luceneuserhelp@jakarta.apache.org
>
>
> 
> To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
> For additional commands, email: luceneuserhelp@jakarta.apache.org
>
> 
> To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
> For additional commands, email: luceneuserhelp@jakarta.apache.org

To unsubscribe, email: luceneuserunsubscribe@jakarta.apache.org
For additional commands, email: luceneuserhelp@jakarta.apache.org
