spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Tanase <>
Subject Re: Whether Spark is appropriate for our use case.
Date Wed, 21 Oct 2015 06:28:27 GMT
Can you share your approximate data size? all should be valid use cases for spark, wondering
if you are providing enough resources.

Also - do you have some expectations in terms of performance? what does "slow down" mean?

For this usecase I would personally favor parquet over DB, and sql/dataframes over regular
spark RDDs as you get some benefits related to predicate pushdown, etc.

Sent from my iPhone

> On 21 Oct 2015, at 00:29, Aliaksei Tsyvunchyk <> wrote:
> Hello all community members,
> I need opinion of people who was using Spark before and can share there experience to
help me select technical approach.
> I have a project in Proof Of Concept phase, where we are evaluating possibility of Spark
usage for our use case. 
> Here is brief task description.
> We should process big amount of raw data to calculate ratings. We have different type
of textual source data. This is just text lines which represents different type of input data
(we call them type 20, type 24, type 26, type 33, etc).
> To perform calculations we should make joins between diffrerent type of raw data - event
records (which represents actual user action) and users description records (which represents
person which performs action) and sometimes with userGroup record (we group all users by some
> All ratings are calculated on daily basis and our dataset could be partitioned by date
(except probably reference data).
> So we have tried to implement it using possibly most obvious way, we parse text file,
store data in parquet format and trying to use sparkSQL to query data and perform calculation.
> Experimenting with sparkSQL I’ve noticed that SQL query speed decreased proportionally
to data size growth. Base on this I assume that SparkSQL performs full records scan while
servicing my SQL queries.
> So here are the questions I’m trying to find answers:
> 1.  Is parquet format appropriate for storing data in our case (to efficiently query
data)? Could it be more suitable to have some DB as storage which could filter data efficiently
before it gets to Spark processing engine ?
> 2.  For now we assume that joins we are doing for calculations slowing down execution.
As alternatives we consider denormalizing data and join it on parsing phase, but this increase
data volume Spark should handle (due to the duplicates we will get). Is it valid approach?
Would it be better if we create 2 RDD, from Parquet files filter them out and next join without
sparkSQL involvement?  Or joins in SparkSQL are fine and we should look for performance bottlenecks
in different place?
> 3.  Should we look closer on Cloudera Impala? As I know it is working over the same parquet
files and I’m wondering whether it gives better performance for data querying ?
> 4.  90% of results we need could be pre-calculated since they are not change after one
day of data is loaded. So I think it makes sense to keep this pre-calculated data in some
DB storage which give me best performance while querying by key. Now I’m consider to use
Cassandra for this purpose due to it’s perfect scalability and performance. Could somebody
provide any other options we can consider ?
> Thanks in Advance,
> Any opinion will be helpful and greatly appreciated
> -- 
> CONFIDENTIALITY NOTICE: This email and files attached to it are 
> confidential. If you are not the intended recipient you are hereby notified 
> that using, copying, distributing or taking any action in reliance on the 
> contents of this information is strictly prohibited. If you have received 
> this email in error please notify the sender and delete this email.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message