spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Dong <dongda...@gmail.com>
Subject question about the TFIDF.
Date Wed, 06 May 2015 19:44:31 GMT
Hi, All,
  When I try to follow the document about tfidf from:
http://spark.apache.org/docs/latest/mllib-feature-extraction.html

     val conf = new SparkConf().setAppName("TFIDF")
     val sc=new SparkContext(conf)

     val
documents=sc.textFile("hdfs://cluster-test-1:9000/user/ubuntu/textExample.txt").map(_.split("
").toSeq)
     val hashingTF = new HashingTF()
     val tf= hashingTF.transform(documents)
     tf.cache()
     val idf = new IDF().fit(tf)
     val tfidf = idf.transform(tf)
     val rdd=tfidf.map { vec => vec}
     rdd.saveAsTextFile("/user/ubuntu/aaa")

I got the following 3 lines output which corresponding to my 3 lines input
file( each line can be viewed as a separate document):
(1048576,[3211,72752,119839,413342,504006,714241],[1.3862943611198906,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453,0.6931471805599453])

(1048576,[53232,96852,109270,119839],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.0])

(1048576,[3139,5740,119839,502586,503762],[0.6931471805599453,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453])

    But how to interpret this? How to match words to the tfidf values? E.g:
word1->1.3862943611198906
word2->0.6931471805599453
......

In general, how should people interpret/analyze "tfidf" from the following?
Thanks!
val tfidf = idf.transform(tf)

  Cheers,
  Dan

Mime
View raw message