Hi Pedro,

Your use case is interesting.  I think launching java gateway is the same as native SparkContext, the only difference is on creating your custom SparkContext instead of native SparkContext. You might also need to wrap it using java. 

https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py#L172



On Thu, Jun 30, 2016 at 9:53 AM, Pedro Rodriguez <ski.rodriguez@gmail.com> wrote:
Hi All,

I have written a Scala package which essentially wraps the SparkContext around a custom class that adds some functionality specific to our internal use case. I am trying to figure out the best way to call this from PySpark.

I would like to do this similarly to how Spark itself calls the JVM SparkContext as in:
https://github.com/apache/spark/blob/v1.6.2/python/pyspark/context.py

My goal would be something like this:

Scala Code (this is done):
>>> import com.company.mylibrary.CustomContext
>>> val myContext = CustomContext(sc)
>>> val rdd: RDD[String] = myContext.customTextFile("path")

Python Code (I want to be able to do this):
>>> from company.mylibrary import CustomContext
>>> myContext = CustomContext(sc)
>>> rdd = myContext.customTextFile("path")

At the end of each code, I should be working with an ordinary RDD[String].

I am trying to access my Scala class through sc._jvm as below, but not having any luck so far.

My attempts:
>>> a = sc._jvm.com.company.mylibrary.CustomContext
>>> dir(a)
['<package or class name>']

Example of what I want::
>>> a = sc._jvm.PythonRDD
>>> dir(a)
['anonfun$6', 'anonfun$8', 'collectAndServe', 'doubleRDDToDoubleRDDFunctions', 'getWorkerBroadcasts', 'hadoopFile', 'hadoopRDD', 'newAPIHadoopFile', 'newAPIHadoopRDD', 'numericRDDToDoubleRDDFunctions', 'rddToAsyncRDDActions', 'rddToOrderedRDDFunctions', 'rddToPairRDDFunctions', 'rddToPairRDDFunctions$default$4', 'rddToSequenceFileRDDFunctions', 'readBroadcastFromFile', 'readRDDFromFile', 'runJob', 'saveAsHadoopDataset', 'saveAsHadoopFile', 'saveAsNewAPIHadoopFile', 'saveAsSequenceFile', 'sequenceFile', 'serveIterator', 'valueOfPair', 'writeIteratorToStream', 'writeUTF']

The next thing I would run into is converting the JVM RDD[String] back to a Python RDD, what is the easiest way to do this?

Overall, is this a good approach to calling the same API in Scala and Python?

--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni





--
Best Regards

Jeff Zhang