spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <>
Subject Re: Spark or MR, Scala or Java?
Date Sun, 23 Nov 2014 16:35:15 GMT
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively before this.
I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis.
To get me jumpstarted into Spark I started this gitHub where there is "IntelliJ-ready-To-run"
code (simple examples of jon, sparksql etc) and I will keep adding to that. I dont know scala
and I am learning that too to help me use Spark better.

Philosophically speaking its possibly not a good idea to take an either/or approach to technology...Like
its never going to be either RDBMS or NOSQL (If the Cassandra behind FB shows 100 fewer likes
instead of 1000 on you Photo a day for some reason u may not be as upset...but if the Oracle/Db2
systems behind Wells Fargo show $100 LESS in your account due to an database error, you will
be PANIC-ing).

So its the same case with Spark or Hadoop. I can speak for myself. I have a usecase for processing
old logs that are multiline (i.e. they have a [begin_timestamp_logid] and [end_timestamp_logid]
and have many lines in  between. In Java Hadoop I created custom RecordReaders to solve this.
I still dont know how to do this in Spark. Till that time I am possibly gonna run the Hadoop
code within Oozie in production. 
Also my current task is evangelizing Big Data at my company. So the tech people I can educate
with Hadoop and Spark and they would learn that but not the business intelligence analysts.
They love SQL so I have to educate them using Hive , Presto, the question is what
is your task or tasks ?

Sorry , a long non technical answer to your question...
Make sense ?
      From: Krishna Sankar <>
 To: Sean Owen <> 
Cc: Guillermo Ortiz <>; user <> 
 Sent: Saturday, November 22, 2014 4:53 PM
 Subject: Re: Spark or MR, Scala or Java?
Adding to already interesting answers:   
   - "Is there any case where MR is better than Spark? I don't know what cases I should be
used Spark by MR. When is MR faster than Spark?"   

   - Many. MR would be better (am not saying faster ;o)) for 
   - Very large dataset,
   - Multistage map-reduce flows,
   - Complex map-reduce semantics
   - Spark is definitely better for the classic iterative,interactive workloads.
   - Spark is very effective for implementing the concepts of in-memory datasets & real
time analytics 
   - Take a look at the Lambda architecture
   - Also checkout how Ooyala is using Spark in multiple layers & configurations. They
also have MR in many places
   - In our case, we found Spark very effective for ELT - we would have used MR earlier
   -  "I know Java, is it worth it to learn Scala for programming to Spark or it's okay
just with Java?"   

   - Java will work fine. Especially when Java 8 becomes the norm, we will get back some of
the elegance
   - I, personally, like Scala & Python lot better than Java. Scala is a lot more elegant,
but compilations, IDE integration et al are still clunky
   - One word of caution - stick with one language as much as possible-shuffling between Java
& Scala is not fun
Cheers & HTH<k/>

On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen <> wrote:

MapReduce is simpler and narrower, which also means it is generally lighter weight, with less
to know and configure, and runs more predictably. If you have a job that is truly just a few
maps, with maybe one reduce, MR will likely be more efficient. Until recently its shuffle
has been more developed and offers some semantics the Spark shuffle does not.I suppose it
integrates with tools like Oozie, that Spark does not. I suggest learning enough Scala to
use Spark in Scala. The amount you need to know is not large.(Mahout MR based implementations
do not run on Spark and will not. They have been removed instead.)On Nov 22, 2014 3:36 PM,
"Guillermo Ortiz" <> wrote:


I'm a newbie with Spark but I've been working with Hadoop for a while.
I have two questions.

Is there any case where MR is better than Spark? I don't know what
cases I should be used Spark by MR. When is MR faster than Spark?

The other question is, I know Java, is it worth it to learn Scala for
programming to Spark or it's okay just with Java? I have done a little
piece of code with Java because I feel more confident with it,, but I
seems that I'm missed something

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message