spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Great Info <gubt...@gmail.com>
Subject Handling Very Large volume(500TB) data using spark
Date Sat, 25 Aug 2018 02:54:13 GMT
Hi All,
I have large volume of data nearly 500TB(from 2016-2018-till date), I have
to do some ETL on that data.

This data is there in the AWS S3, so I planning to use AWS EMR setup to
process this data but I am not sure what should be the config I should
select .

1. Do I need to process monthly or can I process all data at once?
2. What should be Master and slave(executor) memory both Ram and storage?
3. What kind of processor (speed) I need?
4. How many slaves do we need ?

Based on this I want to calculate the cost of AWS EMR and start process the
data

Regards
Indra

Mime
View raw message