spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mri...@gmail.com>
Subject Re: Reading from and writing to different S3 buckets in spark
Date Wed, 12 Oct 2016 19:39:11 GMT
If using RDD's, you can use saveAsHadoopFile or saveAsNewAPIHadoopFile
with the conf passed in which overrides the keys you need.
For example, you can do :

val saveConf = new Configuration(sc.hadoopConfiguration)
// configure saveConf with overridden s3 config
rdd.saveAsNewAPIHadoopFile(..., conf = saveConf)



Regards,
Mridul


On Wed, Oct 12, 2016 at 2:49 AM, Aseem Bansal <asmbansal2@gmail.com> wrote:
> Hi
>
> I want to read CSV from one bucket, do some processing and write to a
> different bucket. I know the way to set S3 credentials using
>
> jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
> jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
>
> But the problem is that spark is lazy. So if do the following
>
> set credentails 1
> read input csv
> do some processing
> set credentials 2
> write result csv
>
> Then there is a chance that due to laziness while reading input csv the
> program may try to use credentails 2.
>
> A solution is to cache the result csv but in case there is not enough
> storage it is possible that the csv will be re-read. So how to handle this
> situation?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message