spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xi Shen <>
Subject Please help me understand TF-IDF Vector structure
Date Sat, 14 Mar 2015 07:05:07 GMT

I read this document,, and tried
to build a TF-IDF model of my documents.

I have a list of documents, each word is represented as a Int, and each
document is listed in one line.

doc_name, int1, int2...
doc_name, int3, int4...

This is how I load my documents:
val documents: RDD[Seq[Int]] = sc.objectFile[(String,
Seq[Int])](s"$sparkStore/documents") map (_._2) cache()

Then I did:

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)

I write the tfidf model to a text file and try to understand the structure.
FileUtils.writeLines(new File("tfidf.out"),

What I is something like:

(1048576,[0,4,7,8,10,13,17,21....],[...some float numbers...])

I think it s a tuple with 3 element.

   - I have no idea what the 1st element is...
   - I think the 2nd element is a list of the word
   - I think the 3rd element is a list of tf-idf value of the words in the
   previous list

Please help me understand this structure.


View raw message