spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jatinganhotra <>
Subject How can I access data on RDDs?
Date Tue, 06 Oct 2015 06:58:23 GMT
Consider the following 2 scenarios:

*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")

*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")

The total time show in the Spark shell Application UI was different for both
scenarios. /Scenario #1 took 0.5 seconds, while scenario #2 took only 0.2

1. I understand that scenario #1 is taking more time, because the RDD is
check-pointed (written to disk). Is there a way I can know the time taken
for checkpoint, from the total time?  

The Spark shell Application UI shows the following - Scheduler delay, Task
Deserialization time, GC time, Result serialization time, getting result
time. But, doesn't show the breakdown for checkpointing.  

2. Is there a way to access the above metrics e.g. scheduler delay, GC time
and save them programmatically? I want to log some of the above metrics for
every action invoked on an RDD.  

3. How can I programmatically access the following information:  
- Size of an RDD, when persisted to disk on checkpointing?  
- How much percentage of an RDD is in memory currently?  
- Overall time taken for computing an RDD?  

Please let me know if you need more information.

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message