I found my problem. I assumed based on TF-IDF in  Wikipedia , that log base 10 is used, but as I found in this discussion, in scala it is actually ln (natural logarithm).


On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab <ashic@live.com> wrote:
Hi Andrejs,
The calculations are a bit different to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et. al.,  Cambridge Press) available here:

Their calculation of IDF is as follows:

IDFi = log2(N / ni)

where N is the number of documents and ni is the number of documents in which the word appears. This looks different to your IDF function.

For TF, they use

TFij = fij / maxk fkj

That is:

For document j,
     the term frequency of the term i in j is the number of times i appears in j divided by the maximum number of times any term appears in j. Stop words are usually excluded when considering the maximum).

So, in your case, the 

TFa1 = 2 / 2 = 1
TFb1 = 1 / 2 = 0.5
TFc1 = 1/2 = 0.5
TFm1 = 2/2 = 1

IDFa = log2(3 / 2) = 0.585

So, TFa1 * IDFa = 0.585

Wikipedia mentions an adjustment to overcome biases for long documents, by calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change anything for TFa1, as the value remains 1.

In other words, my calculations don't agree with yours, and neither seem to agree with Spark :)


Date: Thu, 30 Oct 2014 22:13:49 +0000
Subject: how idf is calculated
From: andrejs@sindicetech.com
To: user@spark.incubator.apache.org

I'm writing a paper and I need to calculate tf-idf. Whit your help I managed to get results, I needed, but the problem is that I need to be able to explain how each number was gotten. So I tried to understand how idf was calculated and the numbers i get don't correspond to those I should get .  

I have 3 documents (each line a document)
a a b c m m
e a c d e e
d j k l m m c

When I calculate tf, I get this 

idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
m -number of documents (3 in my case).
d(t) - in how many documents is term present
a: log(4/3) =0.1249387366
b: log(4/2) =0.3010299957
c: log(4/4) =0
d: log(4/3) =0.1249387366
e: log(4/2) =0.3010299957
l: log(4/2) =0.3010299957
m: log(4/3) =0.1249387366

When I output  idf vector ` idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) `
I get :

I understand why there are only 3 numbers, because only 3 are unique : log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf where calculated 

Best regards,