I was wondering if anyone could help with this question.

On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, <dhruba.work@gmail.com> wrote:

I have a question regarding passing a dictionary from driver to executors in spark on yarn. This dictionary is needed in an udf. I am using pyspark.

As I understand this can be passed in two ways:

1. Broadcast the variable and then use it in the udfs

2. Pass the dictionary in the udf itself, in something like this:

  def udf1(col1, dict):
  def udf1_fn(dict):
    return udf(lambda col_data: udf1(col_data, dict))

  df.withColumn("column_new", udf1_fn(dict)("old_column"))

Well I have tested with both the ways and it works both ways.

Now I am wondering what is fundamentally different between the two. I understand how broadcast work but I am not sure how the data is passed across in the 2nd way. Is the dictionary passed to each executor every time when new task is running on that executor or they are passed only once. Also how the data is passed to the python processes. They are python udfs so I think they are executed natively in python.(Plz correct me if I am wrong). So the data will be serialised and passed to python.

So in summary my question is which will be better/efficient way to write the whole thing and why?

Thank you!