spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mendelson, Assaf" <Assaf.Mendel...@rsa.com>
Subject RE: Nested UDFs
Date Thu, 17 Nov 2016 13:08:23 GMT
The you probably want to create a normal function as opposed to UDF.
A UDF takes your function and applies it on each element in the column one after the other.
You can think of it as working on the result of a loop iterating on the column.

pyspark.sql.function.regexp_replace receives a column and applies the regex on each element
to create a new column.
You can do it in one of two ways:
The first is using a udf in which case you shouldn’t use the pyspark.sql.functions.regex
but instead use standard python regex.
The second is to simply apply the column changes one after the other in a function. This should
be something like:
def my_f(target_col):
                for match,repl in regexp_list:
                                target_col = regexp_replace(target_col, match, repl)
                return target_col

and then use it with:
  Test_data.select(my_f(test_data.name))

The second option is more correct and should provide better performance.

From: Perttu Ranta-aho [mailto:rantaaho@iki.fi]
Sent: Thursday, November 17, 2016 1:50 PM
To: user@spark.apache.org
Subject: Re: Nested UDFs

Hi,

My example was little bogus, my real use case is to do multiple regexp replacements so something
like:

def my_f(data):
    for match, repl in regexp_list:
       data = regexp_replace(match, repl, data)
    return data

I could achieve my goal by mutiple .select(regexp_replace()) lines, but one UDF would be nicer.

-Perttu

to 17. marraskuuta 2016 klo 9.42 Mendelson, Assaf <Assaf.Mendelson@rsa.com<mailto:Assaf.Mendelson@rsa.com>>
kirjoitti:
Regexp_replace is supposed to receive a column, you don’t need to write a UDF for it.
Instead try:
Test_data.select(regexp_Replace(test_data.name<http://test_data.name>, ‘a’, ‘X’)

You would need a Udf if you would wanted to do something on the string value of a single row
(e.g. return data + “bla”)

Assaf.

From: Perttu Ranta-aho [mailto:rantaaho@iki.fi<mailto:rantaaho@iki.fi>]
Sent: Thursday, November 17, 2016 9:15 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Nested UDFs

Hi,

Shouldn't this work?

from pyspark.sql.functions import regexp_replace, udf

def my_f(data):
    return regexp_replace(data, 'a', 'X')
my_udf = udf(my_f)

test_data = sqlContext.createDataFrame([('a',), ('b',), ('c',)], ('name',))
test_data.select(my_udf(test_data.name<http://test_data.name>)).show()

But instead of 'a' being replaced with 'X' I get exception:
  File ".../spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/functions.py", line
1471, in regexp_replace
    jc = sc._jvm.functions.regexp_replace(_to_java_column(str), pattern, replacement)
AttributeError: 'NoneType' object has no attribute '_jvm'

???

-Perttu

Mime
View raw message