I ran the below code in my Standalone mode. Python version 2.7.6. Spacy 1.7+ version. Spark 2.0.1 version.

I'm new pie to pyspark. please help me to understand the below two versions of code. 

why first version run fine whereas second throws pickle.PicklingError: Can't pickle <cyfunction load.<locals>.<lambda> at 0x107e39110>.

(i was doubting that Second approach failure because it could not serialize the object and sent it to worker).

1) Run-Success:

(SpacyExample-Module)

import spacy

nlp = spacy.load('en_default')

def spacyChunks(content):

    doc = nlp(content)

    mp=[]

    for chunk in doc.noun_chunks:

        phrase = content[chunk.start_char: chunk.end_char]

        mp.append(phrase)

    #print(mp)

    return mp

           

    if __name__ == '__main__'

        pass


Main-Module:

spark = SparkSession.builder.appName("readgzip").config(conf=conf).getOrCreate()

gzfile = spark.read.schema(schema).json("")

...

...

textresult.rdd.map(lambda x:x[0]).\

    flatMap(lambda data: SpacyExample.spacyChunks(data)).saveAsTextFile("")




2) Run-Failure:

MainModule:

nlp= spacy.load('en_default')

def spacyChunks(content):

    doc = nlp(content)

    mp=[]

    for chunk in doc.noun_chunks:

        phrase = content[chunk.start_char: chunk.end_char]

        mp.append(phrase)

    #print(mp)

    return mp


if __name__ == '__main__'

create spraksession,read file,

file.rdd.map(..).flatmap(lambdat data:spacyChunks(data).saveAsTextFile()


Stack Trace:

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict

    self._batch_setitems(obj.iteritems())

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems

    save(v)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 331, in save

    self.save_reduce(obj=obj, *rv)

  File "/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 535, in save_reduce

    save(args)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple

    save(element)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict

    self._batch_setitems(obj.iteritems())

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems

    save(v)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 317, in save

    self.save_global(obj, rv)

  File "/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 390, in save_global

    raise pickle.PicklingError("Can't pickle %r" % obj)

pickle.PicklingError: Can't pickle <cyfunction load.<locals>.<lambda> at 0x107e39110>


--
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"