spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Subject RE: pickling a udf
Date Thu, 04 Apr 2019 17:36:56 GMT
Its running in local mode. I’ve ran it in PyCharm and JupyterLab. I’ve restarted the kernel
several times.

B.



From: Abdeali Kothari <abdealikothari@gmail.com>
Sent: Thursday, April 4, 2019 06:35
To: Adaryl Wakefield <adaryl.wakefield@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: pickling a udf

The syntax looks right.
Are you still getting the error when you open a new python session and run this same code
?
Are you running on your laptop with spark local mode or are you running this on a yarn based
cluster ?
It does seem like something in your python session isnt getting serialized right. But does
not look like it's related to this code snippet.

On Thu, Apr 4, 2019 at 3:49 PM Adaryl Wakefield <adaryl.wakefield@hotmail.com<mailto:adaryl.wakefield@hotmail.com>>
wrote:
Are we not supposed to be using udfs anymore? I copied an example straight from a book and
I’m getting weird results and I think it’s because the book is using a much older version
of Spark.  The code below is pretty straight forward but I’m getting an error none the less.
I’ve been doing a bunch of googling and not getting much results.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read.csv("full201801.dat",header="true")

columntransform = udf(lambda x: 'Non-Fat Dry Milk' if x == '23040010' else 'foo', StringType())

df.select(df.PRODUCT_NC, columntransform(df.PRODUCT_NC).alias('COMMODITY')).show()

Error.
Py4JJavaError: An error occurred while calling o110.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed
1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 242, in main
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 144, in read_udfs
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 120, in read_single_udf
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 60, in read_command
  File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 171, in _read_with_length
    return self.loads(obj)
  File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line 566, in loads
    return pickle.loads(obj, encoding=encoding)
TypeError: _fill_function() missing 4 required positional arguments: 'defaults', 'dict', 'module',
and 'closure_values'


B.



Mime
View raw message