spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Massie <mas...@berkeley.edu>
Subject Re: Quality of documentation (rant)
Date Sun, 19 Jan 2014 17:06:10 GMT
Debasiah-

Just wanted to let you know that using Parquet with Spark is really easy to
do. Take a look at
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ for an example.
Parquet provides a HadoopInputFormat to read data and includes support for
predicate pushdown and projection.



--
Matt Massie
UC, Berkeley AMPLab
Twitter: @matt_massie <https://twitter.com/matt_massie>,
@amplab<https://twitter.com/amplab>
https://amplab.cs.berkeley.edu/


On Sun, Jan 19, 2014 at 8:41 AM, Debasish Das <debasish.das83@gmail.com>wrote:

> Hi Ognen,
>
> We have been running hdfs, yarn amd spark on 20 beefy nodes. I give half
> of the cores to spark and use rest for yarn mr. For optimizing the network
> transfer for rdd creation it is better to have spark run on all nodes of
> hdfs.
>
> For preprocessing the data for algorithms I use yarn mr app since the
> input data can be stored in various formats that spark does not support yet
> (things like parquet) but platform people like them due to various reasons
> like data compression. Once the preprocessor saves the data on hdfs as text
> file or sequence file,  then spark gives you orders of magnitude runtime
> compared to yarn algorithm.
>
> I have benchmarked ALS and could run the dataset in 14 mins for 10
> iteration while scalable als algorithm from clodera oryx ran 6 iterations
> in an hour. Note the they are supposedly implementing same als paper. On
> the same dataset mahout als fails as it needs more memory than 6 gb which
> default yarn uses. I have to still look into results in more details and
> the code to be sure what they are doing.
>
> Note that mahout algorithms are not optimized for yarn yet and the master
> mahout branch is broken for yarn. Thanks to Cloudera help, we could patch
> it up. Number of yarn algorithms are not very high right now.
>
> Cdh5.0 is integrating spark with their cdh manager similar to what they
> did with solr. It should be released by March 2014. They have the beta
> already. It will definitely ease up the process to make spark operational.
>
> I have not tested my setup on ec2 (it runs on internal hadoop cluster) but
> for that most likely I will use cdh manager from 5 beta. I will update you
> more with the ec2 experience.
>
> Thanks.
> Deb
> On Jan 19, 2014 6:53 AM, "Ognen Duzlevski" <ognen@nengoiksvelzud.com>
> wrote:
>
>> On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski <
>> ognen@nengoiksvelzud.com> wrote:
>>
>>>
>>> My basic requirement is to set everything up myself and understand it.
>>> For testing purposes my cluster has 15 xlarge instances and I guess I will
>>> just set up a hadoop cluster to run over these instances for the purposes
>>> of getting the benefits of HDFS. I would then set up hdfs over S3 with
>>> blocks.
>>>
>>
>> By this I mean I would set up a Hadoop cluster running in parallel on the
>> same instances just for the purposes of running Spark over HDFS. Is this a
>> reasonable approach? What kind of a performance penalty (memory, CPU
>> cycles) am I going to incur by the Hadoop daemons running just for this
>> purpose?
>>
>> Thanks!
>> Ognen
>>
>

Mime
View raw message