spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ewan Higgs <ewan.hi...@ugent.be>
Subject Re: Terasort example
Date Tue, 11 Nov 2014 20:51:59 GMT
Shall I move the code to spark-perf then and submit a PR? Or shall I 
submit a PR to spark where it can remain an idiomatic example and we can 
clone it in spark-perf where it can potentially evolve non-idiomatic 
optimizations?

Yours,
Ewan

On 11/11/2014 07:58 PM, Reynold Xin wrote:
> This is great. I think the consensus from last time was that we would 
> put performance stuff into spark-perf, so it is easy to test different 
> Spark versions.
>
>
> On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <ewan.higgs@ugent.be 
> <mailto:ewan.higgs@ugent.be>> wrote:
>
>     Hi all,
>     I saw that Reynold Xin had a Terasort example PR on Github[1]. It
>     didn't appear to be similar to the Hadoop Terasort example, so
>     I've tried to brush it into shape so it can generate Terasort
>     files (teragen), sort the files (terasort) and validate the files
>     (teravalidate). My branch is available here:
>
>     https://github.com/ehiggs/spark/tree/terasort
>
>     With this code, you can run the following:
>
>     # Generate 1M 100 byte records:
>      ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in
>
>     # Sort the file:
>     MASTER=local[4] ./bin/run-example terasort.TeraSort
>     ~/data/terasort_in  ~/data/terasort_out
>
>     # Validate the file
>     MASTER=local[4] ./bin/run-example terasort.TeraValidate
>     ~/data/terasort_out  ~/data/terasort_validate
>
>     # Validate that an unsorted file is indeed not correctly sorted:
>
>     MASTER=local[4] ./bin/run-example terasort.TeraValidate
>     ~/data/terasort_in  ~/data/terasort_validate_bad
>
>     This matches the interface for the Hadoop version of Terasort,
>     except I added the ability to use K,M,G,T for record sizes in
>     TeraGen. This code therefore makes a good example of how to use
>     Spark, how to read and write Hadoop files, and also a way to test
>     some of the performance claims of Spark.
>
>     > That's great, but why is this on the mailing list and not
>     submitted as a PR?
>
>     I suspect there are some rough edges and I'd really appreciate
>     reviews. I would also like to know if others can try it out on
>     clusters and tell me if it's performing as it should.
>
>     For example, I find it runs fine on my local machine, but when I
>     try to sort 100G of data on a cluster of 16 nodes, I get >2900
>     file splits. This really eats into the sort time.
>
>     Another issue is that in TeraValidate, to work around SPARK-1018 I
>     had to clone each element. Does this /really/ need to be done?
>     It's pretty lame.
>
>     In any event, I know the Spark 1.2 merge window closed on Friday
>     but as this is only for the examples directory maybe we can slip
>     it in if we can bash it into shape quickly enough?
>
>     Anyway, thanks to everyone on #apache-spark and #scala who helped
>     me get through learning some rudimentary Scala to get this far.
>
>     Yours,
>     Ewan Higgs
>
>     [1] https://github.com/apache/spark/pull/1242
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>     <mailto:dev-unsubscribe@spark.apache.org>
>     For additional commands, e-mail: dev-help@spark.apache.org
>     <mailto:dev-help@spark.apache.org>
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message