spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Femi Anthony <olufemi.anth...@capitalone.com>
Subject Re: [External Sender] How to debug Spark job
Date Fri, 07 Sep 2018 10:32:02 GMT
One way I would go about this would be to try running a new_df.show(numcols,
truncate=False) on a few columns before you try writing to parquet to force
computation of newdf and see whether the hanging is occurring at that point
or during the write. You may also try doing a newdf.count() as well.

Femi

On Fri, Sep 7, 2018 at 5:48 AM James Starks <suserft@protonmail.com.invalid>
wrote:

>
> I have a Spark job that reads from a postgresql (v9.5) table, and write
> result to parquet. The code flow is not complicated, basically
>
>     case class MyCaseClass(field1: String, field2: String)
>     val df = spark.read.format("jdbc")...load()
>     df.createOrReplaceTempView(...)
>     val newdf = spark.sql("seslect field1, field2 from
> mytable").as[MyCaseClass].map { row =>
>       val fieldX = ... // extract something from field2
>       (field1, fileldX)
>     }.filter { ... /* filter out field 3 that's not valid */ }
>     newdf.write.mode(...).parquet(destPath)
>
> This job worked correct without a problem. But it's doesn't look working
> ok (the job looks like hanged) when adding more fields. The refactored job
> looks as below
>     ...
>     val newdf = spark.sql("seslect field1, field2, ... fieldN from
> mytable").as[MyCaseClassWithMoreFields].map { row =>
>         ...
>         NewCaseClassWithMoreFields(...) // all fields plus fieldX
>     }.filter { ... }
>     newdf.write.mode(...).parquet(destPath)
>
> Basically what the job does is extracting some info from one of a field in
> db table, appends that newly extracted field to the original row, and then
> dumps the whole new table to parquet.
>
>     new filed + (original field1 + ... + original fieldN)
>     ...
>     ...
>
> Records loaded by spark sql to spark job (before refactored) are around
> 8MM, this remains the same, but when the refactored spark runs, it looks
> hanging there without progress. The only output on the console is (there is
> no crash, no exceptions thrown)
>
>     WARN  HeartbeatReceiver:66 - Removing executor driver with no recent
> heartbeats: 137128 ms exceeds timeout 120000 ms
>
> Memory in top command looks like
>
>     VIRT     RES     SHR        %CPU %MEM
>     15.866g 8.001g  41.4m     740.3   25.6
>
> The command used to  submit spark job is
>
>     spark-submit --class ... --master local[*] --driver-memory 10g
> --executor-memory 10g ... --files ... --driver-class-path ... <jar file>
> ...
>
> How can I debug or check which part of my code might cause the problem (so
> I can improve it)?
>
> Thanks
>
>
>
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
View raw message