spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bimal Chandra Naresh <bimal.nar...@zee.com>
Subject delta lake implementation issue in large data set
Date Tue, 22 Dec 2020 16:10:18 GMT
Hi All,

We have implemented delta lake in spark scala. It’s working fine for small data set.
We need to be updated appox 20M records on daily basis on delta lake using spark EMR jobs(5
node m5.2xlarge machine) . Jobs has been failed with memory error .

Error  Terminated with errors All slaves in the job flow were terminated

[cid:image001.png@01D6D8AB.0A35E600]

source code :

var source_location_distination_data_lake = “s3://test/deltalake”

DeltaTable.forPath(spark, source_location_distination_data_lake)
  .as("events")
  .merge(
    df_reader_update.as("updates"),
    "events.id = updates.id")
  .whenMatched
  .updateExpr(
    Map("action_name" -> "updates.action_name",
      "action_date" -> "updates.action_date"))
  .whenNotMatched
  .insertExpr(
    Map(
      "id" -> "updates.id",
      "package_name" -> "updates.package_name",
      "action_name" -> "updates.action_name",
      "action_date" -> "updates.action_date"
       ))
  .execute()

Please suggest for better optimized code in delta lake.

Thanks,
Bimal Naresh
Disclaimer: This e-mail and any documents, files, or previous e-mail messages appended or
attached to it may contain confidential and/or privileged information. If you are not the
intended recipient (or have received this e-mail in error) please notify the sender immediately
and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material
in this e-mail is strictly forbidden
Mime
View raw message