spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perry Chu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
Date Fri, 01 Jun 2018 02:12:00 GMT
Perry Chu created SPARK-24447:
---------------------------------

             Summary: Pyspark RowMatrix.columnSimilarities() loses spark context
                 Key: SPARK-24447
                 URL: https://issues.apache.org/jira/browse/SPARK-24447
             Project: Spark
          Issue Type: Bug
          Components: MLlib, PySpark
    Affects Versions: 2.3.0
            Reporter: Perry Chu


The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears to
be losing track of the spark context. 

I'm pretty new to spark - not sure if the problem is on the python side or the scala side
- would appreciate someone more experienced taking a look.

This snippet should reproduce the error:
{code:java}
from pyspark.mllib.linalg.distributed import RowMatrix

rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
matrix = RowMatrix(rows)
sims = matrix.columnSimilarities()

## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
print(sims.numRows(),sims.numCols())

## This throws an error (stack trace below)
print(sims.entries.first())

## Later I tried this
print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline>
print(sims.entries.context) #<SparkContext master=yarn appName = PySparkShell>, then
throws an error{code}
Error stack trace
{code:java}
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-47-50f83a6cf449> in <module>()
----> 1 sims.entries.first()

/usr/lib/spark/python/pyspark/rdd.py in first(self)
1374 ValueError: RDD is empty
1375 """
-> 1376 rs = self.take(1)
1377 if rs:
1378 return rs[0]

/usr/lib/spark/python/pyspark/rdd.py in take(self, num)
1356
1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
1359
1360 items += res

/usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
999 # SparkContext#runJob.
1000 mappedRDD = rdd.mapPartitions(partitionFunc)
-> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
1003

AttributeError: 'NoneType' object has no attribute 'sc'
{code}
PySpark columnSimilarities documentation

http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message