spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manohar Reddy <Manohar.Re...@happiestminds.com>
Subject RE: Spark Read from Google store and save in AWS s3
Date Thu, 05 Jan 2017 20:07:00 GMT
Hi Steve,
Thanks for the reply and below is follow-up help needed from you.
Do you mean we can set up two native file system to single sparkcontext ,so then based on
urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) will that identify and write/read
appropriate cloud.

Is that my understanding right?

Manohar
From: Steve Loughran [mailto:stevel@hortonworks.com]
Sent: Thursday, January 5, 2017 11:05 PM
To: Manohar Reddy
Cc: user@spark.apache.org
Subject: Re: Spark Read from Google store and save in AWS s3


On 5 Jan 2017, at 09:58, Manohar753 <manohar.reddy@happiestminds.com<mailto:manohar.reddy@happiestminds.com>>
wrote:

Hi All,

Using spark is  interoperability communication between two
clouds(Google,AWS) possible.
in my use case i need to take Google store as input to spark and do some
processing and finally needs to store in S3 and my spark engine runs on AWS
Cluster.

Please let me back is there any way for this kind of usecase bu using
directly spark without any middle components and share the info or link if
you have.

Thanks,

I've not played with GCS, and have some noted concerns about test coverage ( https://github.com/GoogleCloudPlatform/bigdata-interop/pull/40<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2Fbigdata-interop%2Fpull%2F40&data=01%7C01%7Cmanohar.reddy%40happiestminds.com%7Cbf37d0a14cf546775eff08d4359124ab%7C7742820587ff4048a64591b337240228%7C0&sdata=cDw0a70YhyRfMjF6po61PqRPEPr0u1HKfoUdqk4%2FRsw%3D&reserved=0>
) , but assuming you are not hitting any specific problems, it should be a matter of having
the input as gs://bucket/path and dest s3a://bucket-on-s3/path2

You'll need the google storage JARs on your classpath, along with those needed for S3n/s3a.

1. little talk on the topic, though I only play with azure and s3
https://www.youtube.com/watch?v=ND4L_zSDqF0<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DND4L_zSDqF0&data=01%7C01%7Cmanohar.reddy%40happiestminds.com%7Cbf37d0a14cf546775eff08d4359124ab%7C7742820587ff4048a64591b337240228%7C0&sdata=zxNR4LC16FUXK9gTYUgf1%2B1B6hYE6dhluQxyPlxOb84%3D&reserved=0>

2. some notes; bear in mind that the s3a performance tuning covered relates to things surfacing
in Hadoop 2.8, which you probably wont have.


https://hortonworks.github.io/hdp-aws/s3-spark/<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhortonworks.github.io%2Fhdp-aws%2Fs3-spark%2F&data=01%7C01%7Cmanohar.reddy%40happiestminds.com%7Cbf37d0a14cf546775eff08d4359124ab%7C7742820587ff4048a64591b337240228%7C0&sdata=mcyxnzSOq1Tx05kBvifZ9TCcoymaiTS47lAJyTH5KZw%3D&reserved=0>

A one line test for s3 installed is can you read the landsat CSV file

sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()

this should work from wherever you are if your classpath and credentials are set up
________________________________
Happiest Minds Disclaimer

This message is for the sole use of the intended recipient(s) and may contain confidential,
proprietary or legally privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the original intended recipient of the message,
please contact the sender by reply email and destroy all copies of the original message.

Happiest Minds Technologies <http://www.happiestminds.com>

________________________________

Mime
View raw message