spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhrubajyoti Hati <dhruba.w...@gmail.com>
Subject Collections passed from driver to executors
Date Fri, 20 Sep 2019 06:22:49 GMT
Hi,

I have a question regarding passing a dictionary from driver to executors
in spark on yarn. This dictionary is needed in an udf. I am using pyspark.

As I understand this can be passed in two ways:

1. Broadcast the variable and then use it in the udfs

2. Pass the dictionary in the udf itself, in something like this:

  def udf1(col1, dict):
   ..
  def udf1_fn(dict):
    return udf(lambda col_data: udf1(col_data, dict))

  df.withColumn("column_new", udf1_fn(dict)("old_column"))

Well I have tested with both the ways and it works both ways.

Now I am wondering what is fundamentally different between the two. I
understand how broadcast work but I am not sure how the data is passed
across in the 2nd way. Is the dictionary passed to each executor every time
when new task is running on that executor or they are passed only once.
Also how the data is passed to the python processes. They are python udfs
so I think they are executed natively in python.(Plz correct me if I am
wrong). So the data will be serialised and passed to python.

So in summary my question is which will be better/efficient way to write
the whole thing and why?

Thank you!

Regards,
Dhrub

Mime
View raw message