I found my problem. I assumed based on TF-IDF in =C2=A0Wik=
ipedia , that log base 10 is used, but as I found in this discussion, in scala it is actually ln (natural logarithm).

Regards,

Andrejs

On Thu, Oct 30, 2014 at 10:49 PM, Ashic M=
ahtab <ashic@live.com> wrote:

Hi Andrejs,The calculations are a bit different = to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et.= al., =C2=A0Cambridge Press) available here:http://www.mmds.org/=C2=A0Their calculation of IDF is as follows:IDFi =3D log2(N / ni) where N is the number of d= ocuments and ni is the number of documents in which the word appears. This = looks different to your IDF function.For TF, they= useTFij =3D fij / maxk fkj<= div>That is:For document j,=C2=A0 =C2= =A0 =C2=A0the term frequency of the term i in j is the number of times i ap= pears in j divided by the maximum number of times any term appears in j. St= op words are usually excluded when considering the maximum).

=So, in your case, the=C2=A0TFa1 =3D 2 = / 2 =3D 1

TFb1 =3D 1 / 2 =3D 0.5TFc1 =3D 1/2 =3D 0.5TFm1 =3D 2/2 =3D 1

...

IDFa =3D log2(3=
/ 2) =3D 0.585

So, TFa1 * IDFa =3D 0.585

Wikipedia mentions an adjustment to overcome biases for lo=
ng documents, by calculating TFij =3D 0.5 + {(0.5*fij)/maxk fkj}, but that =
doesn't change anything for TFa1, as the value remains 1.

In other words, my calculations don't agree with yours, and=
neither seem to agree with Spark :)

Regards,

Date: Thu, 30 Oct 2014 22:13= :49 +0000

Subject: how idf is calculated

From: andrejs@sindicetech.com

To: <= a href=3D"mailto:user@spark.incubator.apache.org" target=3D"_blank">user@sp= ark.incubator.apache.org

<=
/div>
Ashic.

Date: Thu, 30 Oct 2014 22:13= :49 +0000

Subject: how idf is calculated

From: andrejs@sindicetech.com

To: <= a href=3D"mailto:user@spark.incubator.apache.org" target=3D"_blank">user@sp= ark.incubator.apache.org

Hi,

I'm writing a paper and I need to calculate tf-idf.=
Whit your help I managed to get results, I needed, but the problem is that=
I need to be able to explain how each number was gotten. So I tried to und=
erstand how idf was calculated and the numbers i get don't correspond t=
o those I should get . =C2=A0

I have 3 docume=
nts (each line a document)

a a b c m m

e a c d e e

e: log(4/2) =3D0.3010299957

d j k l m m c

When I calculate tf, I get thi=
s=C2=A0

(1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.=
0])

(1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])

(1048576=
,[97,99,100,101],[1.0,1.0,1.0,3.0]

idf is supposed=
ly calculated idf =3D log((m + 1) / (d(t) + 1))

m -number of docu=
ments (3 in my case).

d(t) - in how many documents is term presen=
t

a: log(4/3) =3D0.1249387366

b: log(4/2) =3D0.30102999=
57

c: log(4/4) =3D0

d: log(4/3) =3D0.1249387366

l: log(4/2) =3D0.3010299957

<=
div>m: log(4/3) =3D0.1249387366When I output =C2=
=A0idf vector ` idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(=
_)) `

I get :

1.3862943611198906

0.2876820724=
5178085

0.6931471805599453

I understand =
why there are only 3 numbers, because only 3 are unique : log(4/2), log(4/3=
), log(4/4), but I don't understand how numbers in idf where calculated=
=C2=A0

Best regards,

Andrejs=C2=A0=

--20cf301af5ff1f5aa40506b581c4--