spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <>
Subject Re: ml word2vec finSynonyms return type
Date Thu, 05 Jan 2017 20:44:46 GMT
Given how Word2Vec is used the pipeline model in the new ml implementation, we might need to
keep the current behavior?

From: Asher Krim <<>>
Sent: Tuesday, January 3, 2017 11:58 PM
Subject: Re: ml word2vec finSynonyms return type
To: Felix Cheung <<>>
Cc: <<>>, Joseph
Bradley <<>>, <<>>

The jira:

Adding new methods could result in method clutter. Changing behavior of non-experimental classes
is unfortunate (ml Word2Vec was marked Experimental until Spark 2.0). Neither option is great.
If I had to pick, I would rather change the existing methods to keep the class simpler moving

On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <<>>
Could you link to the JIRA here?

What you suggest makes sense to me. Though we might want to maintain compatibility and add
a new method instead of changing the return type of the existing one.

From: Asher Krim <<>>
Sent: Wednesday, December 28, 2016 11:52 AM
Subject: ml word2vec finSynonyms return type
To: <<>>
Cc: <<>>, Joseph
Bradley <<>>

Hey all,

I would like to propose changing the return type of `findSynonyms` in ml's Word2Vec<>:

def findSynonyms(word: String, num: Int): DataFrame = {
  val spark = SparkSession.builder().getOrCreate()
  spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")

I find it very strange that the results are parallelized before being returned to the user.
The results are already on the driver to begin with, and I can imagine that for most usecases
(and definitely for ours) the synonyms are collected right back to the driver. This incurs
both an added cost of shipping data to and from the cluster, as well as a more cumbersome
interface than needed.

Can we change it to just the following?

def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
  wordVectors.findSynonyms(word, num)

If the user wants the results parallelized, they can still do so on their own.

(I had brought this up a while back in Jira. It was suggested that the mailing list would
be a better forum to discuss it, so here we are.)

Asher Krim
Senior Software Engineer

View raw message