spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Query related to spark cluster
Date Mon, 30 May 2016 06:42:21 GMT

Well if you require R then you need to install it (including all additional packages) on each
node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is
sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access
them.

> On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) <saurabh.5.kumar@nokia.com>
wrote:
> 
> Hi Team,
>  
> I am using Apache spark to build scalable Analytic engine. My setup is as follows.
>  
> Flow of processing is as follows:
>  
> Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database >
R process data fom Postgre-XL to process in distributed mode.
>  
> I have 6 nodes cluster setup for ETL operations which have
>  
> Spark slaves installed on all 6 of them.
> HDFS data nodes on each of 6 nodes with replication factor 2.
> PosGRE –XL 9.5 Database coordinator on each of 6 nodes.
> R software is installed on all nodes and Uses process Data from Postgre-XL in distributed
manner.
>  
>  
>  
>  
> Can you please guide me about pros and cons of this setup.
> Installing all component on every machines is recommended or there is any drawback?
> R software should run on spark cluster ?
>  
>  
>  
> Thanks & Regards
> Saurabh Kumar
> R&D Engineer, T&I TED Technology Explorat&Disruption
> Nokia Networks
> L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
> Mobile: +91-8861012418
> http://networks.nokia.com/
>  
>  
>  

Mime
View raw message