spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gourav Sengupta <gourav.sengu...@gmail.com>
Subject Re: pySpark - pandas UDF and binaryType
Date Sat, 04 May 2019 01:25:34 GMT
And also be aware that pandas UDF does not always lead to better
performance and sometimes even massively slow performance.

With Grouped Map dont you run into the risk of random memory errors as well?

On Thu, May 2, 2019 at 9:32 PM Bryan Cutler <cutlerb@gmail.com> wrote:

> Hi,
>
> BinaryType support was not added until Spark 2.4.0, see
> https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0
> or greater is require as you saw in the docs.
>
> Bryan
>
> On Thu, May 2, 2019 at 4:26 AM Nicolas Paris <nicolas.paris@riseup.net>
> wrote:
>
>> Hi all
>>
>> I am using pySpark 2.3.0 and pyArrow 0.10.0
>>
>> I want to apply a pandas-udf on a dataframe with <String, binaryType>
>> I have the bellow error:
>>
>> > Invalid returnType with grouped map Pandas UDFs:
>> >
>> StructType(List(StructField(filename,StringType,true),StructField(contents,BinaryType,true)))
>> > is not supported
>>
>>
>> I am missing something ?
>> the doc
>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#supported-sql-types
>> says pyArrow 0.10 is minimum to handle binaryType
>>
>> here is the code:
>>
>> > from pyspark.sql.functions import pandas_udf, PandasUDFType
>> >
>> > df = sql("select filename, contents from test_binary")
>> >
>> > @pandas_udf("filename String, contents binary",
>> PandasUDFType.GROUPED_MAP)
>> > def transform_binary(pdf):
>> >     contents = pdf.contents
>> >     return pdf.assign(contents=contents)
>> >
>> > df.groupby("filename").apply(transform_binary).count()
>>
>> Thanks
>> --
>> nicolas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Mime
View raw message