mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Centroid calculations with sparse vectors
Date Thu, 28 May 2009 07:18:45 GMT
You are exactly correct that this is an important distinction.  But that
wasn't really the problem.  In the problem as stated, the numbers were word
counts which are quite plausibly zero in the absence of an observation.  For
rating data, there is a very different situation.

Even with counts were no data == 0, there is some question about the proper
way to handle the zeros.  My own feeling is that the best interpretation is
"not observed yet".  If you use a proper probabilistic interpretation, then
if you have made many observations and still have a zero, then you have a
constraint while if you have made very few observations and have a zero, it
means little.  Again, with ratings data, the problem is different.  There it
is entirely appropriate to treat no data as something that gives us little
or no information unless the ratings are chosen by the user in which case no
rating is actually (slightly) informative.

How you combine both forms of data is another question entirely.

On Thu, May 28, 2009 at 12:07 AM, Sean Owen <> wrote:

> So there is no way to distinguish between 0 and "no value" --
> but conceptually those two are quite different things. Am I right
> about this? I think the API would have to change then. I tripped on a
> very similar problem early on in implementing cosine-measure / Pearson
> correlation, for exactly the same reason. (Or perhaps I am just
> projecting my past problem/solution here.)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message