spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xi Shen <davidshe...@gmail.com>
Subject Re: Please help me understand TF-IDF Vector structure
Date Sat, 14 Mar 2015 07:36:52 GMT
Hey, I work it out myself :)

The "Vector" is actually a "SparesVector", so when it is written into a
string, the format is

(size, [coordinate....], [value...])


Simple!


On Sat, Mar 14, 2015 at 6:05 PM Xi Shen <davidshen84@gmail.com> wrote:

> Hi,
>
> I read this document,
> http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and
> tried to build a TF-IDF model of my documents.
>
> I have a list of documents, each word is represented as a Int, and each
> document is listed in one line.
>
> doc_name, int1, int2...
> doc_name, int3, int4...
>
> This is how I load my documents:
> val documents: RDD[Seq[Int]] = sc.objectFile[(String,
> Seq[Int])](s"$sparkStore/documents") map (_._2) cache()
>
> Then I did:
>
> val hashingTF = new HashingTF()
> val tf: RDD[Vector] = hashingTF.transform(documents)
> val idf = new IDF().fit(tf)
> val tfidf = idf.transform(tf)
>
> I write the tfidf model to a text file and try to understand the structure.
> FileUtils.writeLines(new File("tfidf.out"),
> tfidf.collect().toList.asJavaCollection)
>
> What I is something like:
>
> (1048576,[0,4,7,8,10,13,17,21....],[...some float numbers...])
> ...
>
> I think it s a tuple with 3 element.
>
>    - I have no idea what the 1st element is...
>    - I think the 2nd element is a list of the word
>    - I think the 3rd element is a list of tf-idf value of the words in
>    the previous list
>
> Please help me understand this structure.
>
>
> Thanks,
> David
>
>
>
>

Mime
View raw message