spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Regexp_extract not giving correct output
Date Wed, 02 Dec 2020 15:36:48 GMT
As in Java/Scala, in Python you'll need to escape the backslashes with \\.
"\[" means just "[" in a string. I think you could also prefix the string
literal with 'r' to disable Python's handling of escapes.

On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka <connectsachit@gmail.com>
wrote:

> Hi All,
>
> I am using Pyspark to get the value from a column on basis of regex.
>
> Following is the regex which I am using:
>
> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>
> df = spark.createDataFrame([("[1234] [3333] [4444] [66]",),
> ("abcd",)],["stringValue"])
>
> result = df.withColumn('extracted value',
> F.regexp_extract(F.col('stringValue'),
> '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',
> 1))
>
> I have tried with spark.sql as well. It is giving empty output.
>
> I have tested this regex , it is working fine on an online regextester .
> But it is not working in spark . I know spark needs Java based regex ,
> hence I tried escaping also , that gave exception:
> : java.util.regex.PatternSyntaxException: Unknown inline modifier near
> index 21
>
> (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
>
>
> Can you please help here?
>
> Kind Regards,
> Sachit Murarka
>

Mime
View raw message