spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: Custom RDD: Report Size of Partition in Bytes to Spark
Date Mon, 04 Jul 2016 14:32:17 GMT
Just realized I had been replying back to only Takeshi.

Thanks for tip as it got me on the right track. Running into an issue with private [spark]
methods though. It looks like the input metrics start out as None and are not initialized
(verified by throwing new Exception on pattern match cases when it is None and when its not).
Looks like NewHadoopRDD calls getInputMetricsForReadMethod which sets _inputMetrics if it
is None, but it is unfortunately it is private [spark]. Is there a way for external RDDs to
access this method or somehow initialize _inputMetrics in 1.6.X (looks like 2.0 makes more
of this API public)?

Using reflection I was able to implement it mimicking the NewHadoopRDD code, but if possible
would like to avoid using reflection. Below is the source code for the method that works.

RDD code: https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/S3RDD.scala#L100-L105
Reflection code: https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/PrivateMethodUtil.scala

Thanks,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 3, 2016 at 10:31:30 PM, Takeshi Yamamuro (linguin.m.s@gmail.com) wrote:

How about using `SparkListener`?
You can collect IO statistics thru TaskMetrics#inputMetrics by yourself.

// maropu

On Mon, Jul 4, 2016 at 11:46 AM, Pedro Rodriguez <ski.rodriguez@gmail.com> wrote:
Hi All,

I noticed on some Spark jobs it shows you input/output read size. I am implementing a custom
RDD which reads files and would like to report these metrics to Spark since they are available
to me.

I looked through the RDD source code and a couple different implementations and the best I
could find were some Hadoop metrics. Is there a way to simply report the number of bytes a
partition read so Spark can put it on the UI?

Thanks,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn



--
---
Takeshi Yamamuro

Mime
View raw message