spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manjunath Shetty H <manjunathshe...@live.com>
Subject Re: How to collect Spark dataframe write metrics
Date Wed, 04 Mar 2020 12:52:48 GMT
Thanks Zohar,

Will try that


-
Manjunath
________________________________
From: Zohar Stiro <zszohar89@gmail.com>
Sent: Tuesday, March 3, 2020 1:49 PM
To: Manjunath Shetty H <manjunathshetty@live.com>
Cc: user <user@spark.apache.org>
Subject: Re: How to collect Spark dataframe write metrics

Hi,

to get DataFrame level write metrics you can take a look at the following trait :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178

- about the good practise - it depends on your use case but Generally speaking I would not
do it - at least not for checking your logic/ checking spark is working correctly.

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ <‪manjunathshetty@live.com<mailto:manjunathshetty@live.com>‬‏>:‬
Hi all,

Basically my use case is to validate the DataFrame rows count before and after writing to
HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write metrics ?

Any pointers would be helpful.


Thanks and Regards
Manjunath
Mime
View raw message