spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meeraj Kunnumpurath <mee...@servicesymphony.com>
Subject Nearest neighbour search
Date Sun, 13 Nov 2016 15:04:07 GMT
Hello,

I have a dataset containing TF-IDF vectors for a corpus of documents. How
do I perform a nearest neighbour search on the dataset, using cosine
similarity?

  val df = spark.read.option("header", "false").csv("data")

  val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")

  val tf = new HashingTF().setInputCol("words").setOutputCol("tf")

  val idf = new IDF().setInputCol("tf").setOutputCol("tf-idf")

  val df1 = tf.transform(tk.transform(df))

  idf.fit(df1).transform(df1).select("tf-idf").show(10)
Thank you

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169meeraj@servicesymphony.com <meeraj@servicesymphony.com>*

Mime
View raw message