I found my problem. I assumed based on TF-IDF in Wikipedia , that log base 10 is used, but as I found in this discussion , in scala it is actually ln (natural logarithm). Regards, Andrejs On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab wrote: > Hi Andrejs, > The calculations are a bit different to what I've come across in Mining > Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here: > http://www.mmds.org/ > > Their calculation of IDF is as follows: > > IDFi = log2(N / ni) > > where N is the number of documents and ni is the number of documents in > which the word appears. This looks different to your IDF function. > > For TF, they use > > TFij = fij / maxk fkj > > That is: > > For document j, > the term frequency of the term i in j is the number of times i > appears in j divided by the maximum number of times any term appears in j. > Stop words are usually excluded when considering the maximum). > > So, in your case, the > > TFa1 = 2 / 2 = 1 > TFb1 = 1 / 2 = 0.5 > TFc1 = 1/2 = 0.5 > TFm1 = 2/2 = 1 > ... > > IDFa = log2(3 / 2) = 0.585 > > So, TFa1 * IDFa = 0.585 > > Wikipedia mentions an adjustment to overcome biases for long documents, by > calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change > anything for TFa1, as the value remains 1. > > In other words, my calculations don't agree with yours, and neither seem > to agree with Spark :) > > Regards, > Ashic. > > ------------------------------ > Date: Thu, 30 Oct 2014 22:13:49 +0000 > Subject: how idf is calculated > From: andrejs@sindicetech.com > To: user@spark.incubator.apache.org > > > Hi, > I'm writing a paper and I need to calculate tf-idf. Whit your help I > managed to get results, I needed, but the problem is that I need to be able > to explain how each number was gotten. So I tried to understand how idf was > calculated and the numbers i get don't correspond to those I should get . > > I have 3 documents (each line a document) > a a b c m m > e a c d e e > d j k l m m c > > When I calculate tf, I get this > (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0]) > (1048576,[97,98,99,109],[2.0,1.0,1.0,2.0]) > (1048576,[97,99,100,101],[1.0,1.0,1.0,3.0] > > idf is supposedly calculated idf = log((m + 1) / (d(t) + 1)) > m -number of documents (3 in my case). > d(t) - in how many documents is term present > a: log(4/3) =0.1249387366 > b: log(4/2) =0.3010299957 > c: log(4/4) =0 > d: log(4/3) =0.1249387366 > e: log(4/2) =0.3010299957 > l: log(4/2) =0.3010299957 > m: log(4/3) =0.1249387366 > > When I output idf vector ` > idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) ` > I get : > 1.3862943611198906 > 0.28768207245178085 > 0.6931471805599453 > > I understand why there are only 3 numbers, because only 3 are unique : > log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf > where calculated > > Best regards, > Andrejs > >