spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perry Chu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context
Date Thu, 07 Jun 2018 21:37:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505335#comment-16505335
] 

Perry Chu commented on SPARK-24447:
-----------------------------------

Thanks for following up. I tried running on a fresh spark 2.3 download, and it worked for
me too.

After testing a few things, I realized there's one more thing I'm doing that causes the error.
Within pyspark, I stop and restart the spark session to adjust some config settings. Could
you try the following?
{code}
## Do this before the code snippet
import pyspark
spark.stop()
spark = pyspark.sql.SparkSession.builder.getOrCreate() 

## Code snippet should reproduce error now{code}
So it looks like the rdd returned by RowMatrix.columnSimilarities() is still referring to
the original SparkSession (that I stopped), and doesn't have a handle on the new one.

I'm not quite sure if this counts as a bug anymore... obvious workaround is just to use command
line or spark conf to do my config rather than starting and stopping spark. However, it seems
strange to me that Spark appears to be doing all the work for RowMatrix.columnSimilarities()
(I see tasks getting executed in the UI, and sims.numCols() works) but then simply can't display
the resulting matrix!

> Pyspark RowMatrix.columnSimilarities() loses spark context
> ----------------------------------------------------------
>
>                 Key: SPARK-24447
>                 URL: https://issues.apache.org/jira/browse/SPARK-24447
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Perry Chu
>            Priority: Major
>
> The RDD behind the CoordinateMatrix returned by RowMatrix.columnSimilarities() appears
to be losing track of the spark context. 
> I'm pretty new to spark - not sure if the problem is on the python side or the scala
side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline>
> print(sims.entries.context) #<SparkContext master=yarn appName = PySparkShell>,
then throws an error{code}
> Error stack trace
> {code:java}
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
> <ipython-input-47-50f83a6cf449> in <module>()
> ----> 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions,
allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message