I found my problem. I assumed based on TF-IDF in Wikipedia , that log base
10 is used, but as I found in this discussion
, in
scala it is actually ln (natural logarithm).
Regards,
Andrejs
On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab wrote:
> Hi Andrejs,
> The calculations are a bit different to what I've come across in Mining
> Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here:
> http://www.mmds.org/
>
> Their calculation of IDF is as follows:
>
> IDFi = log2(N / ni)
>
> where N is the number of documents and ni is the number of documents in
> which the word appears. This looks different to your IDF function.
>
> For TF, they use
>
> TFij = fij / maxk fkj
>
> That is:
>
> For document j,
> the term frequency of the term i in j is the number of times i
> appears in j divided by the maximum number of times any term appears in j.
> Stop words are usually excluded when considering the maximum).
>
> So, in your case, the
>
> TFa1 = 2 / 2 = 1
> TFb1 = 1 / 2 = 0.5
> TFc1 = 1/2 = 0.5
> TFm1 = 2/2 = 1
> ...
>
> IDFa = log2(3 / 2) = 0.585
>
> So, TFa1 * IDFa = 0.585
>
> Wikipedia mentions an adjustment to overcome biases for long documents, by
> calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change
> anything for TFa1, as the value remains 1.
>
> In other words, my calculations don't agree with yours, and neither seem
> to agree with Spark :)
>
> Regards,
> Ashic.
>
> ------------------------------
> Date: Thu, 30 Oct 2014 22:13:49 +0000
> Subject: how idf is calculated
> From: andrejs@sindicetech.com
> To: user@spark.incubator.apache.org
>
>
> Hi,
> I'm writing a paper and I need to calculate tf-idf. Whit your help I
> managed to get results, I needed, but the problem is that I need to be able
> to explain how each number was gotten. So I tried to understand how idf was
> calculated and the numbers i get don't correspond to those I should get .
>
> I have 3 documents (each line a document)
> a a b c m m
> e a c d e e
> d j k l m m c
>
> When I calculate tf, I get this
> (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])
> (1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])
> (1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]
>
> idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
> m -number of documents (3 in my case).
> d(t) - in how many documents is term present
> a: log(4/3) =0.1249387366
> b: log(4/2) =0.3010299957
> c: log(4/4) =0
> d: log(4/3) =0.1249387366
> e: log(4/2) =0.3010299957
> l: log(4/2) =0.3010299957
> m: log(4/3) =0.1249387366
>
> When I output idf vector `
> idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) `
> I get :
> 1.3862943611198906
> 0.28768207245178085
> 0.6931471805599453
>
> I understand why there are only 3 numbers, because only 3 are unique :
> log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf
> where calculated
>
> Best regards,
> Andrejs
>
>