spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ankurdave <>
Subject Re: Benchmarking Graphx
Date Tue, 20 May 2014 01:53:20 GMT
On May 17, 2014 at 2:59pm, Hari wrote:
&gt; a) Is there a way to get the total time taken for the execution from
start to finish?
Assuming you're running the benchmark as a standalone program, such as by
invoking the  Analytics driver

, you could wrap the driver invocation using time:
/usr/bin/time -p ./bin/spark-submit ...
If you're using spark-shell, you could use System.currentTimeMillis.
&gt; b) log4j properties need to be modified to turn off logging, but its
not clear how to. 
Create  conf/
by copying conf/ and changing the first line to
log4j.rootCategory=WARN, console
&gt; c) how can this be extended to a cluster?
It should work just to invoke the driver on the cluster using spark-submit.
If you aren't using the Analytics driver, make sure to set the same  Spark
as it does (spark.serializer, spark.kryo.registrator, and
&gt; d) also how to quantify memory overhead if i added more functionality
to the execution?
You can see how much memory each cached RDD is taking up by looking at the 
web UI <> 
&gt; e) any scripts? reports generated?
We don't have well-supported benchmark scripts for GraphX yet. Dan Crankshaw
has some personal-use  scripts <>  
for setting up GraphX and competing graph systems on a cluster and running
some benchmarks. You could look at those for some ideas.
There are benchmarks from earlier this year in the GraphX  arXiv paper
<>  . These are on the  soc-LiveJournal
<>   and  twitter-2010
<>   datasets.

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message