spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: covert local tsv file to orc file on distributed cloud storage(openstack).
Date Sun, 20 Nov 2016 16:52:35 GMT

On 19 Nov 2016, at 17:21, vr spark <vrspark123@gmail.com<mailto:vrspark123@gmail.com>>
wrote:

Hi,
I am looking for scala or python code samples to covert local tsv file to orc file and store
on distributed cloud storage(openstack).

So, need these 3 samples. Please suggest.

1. read tsv
2. convert to orc
3. store on distributed cloud storage


thanks
VR

all options, 9 lines of code, assuming a spark context has already been setup with the permissions
to write to AWS, and the relevant JARs for S3A to work on the CP. The read operation is inefficient
as to determine the schema it scans the (here, remote) file twice; that may be OK for an example,
but I wouldn't do that in production. The source is a real file belonging to amazon; dest
a bucket of mine.

More details at: http://www.slideshare.net/steve_l/apache-spark-and-object-stores


val csvdata = spark.read.options(Map(
  "header" -> "true",
  "ignoreLeadingWhiteSpace" -> "true",
  "ignoreTrailingWhiteSpace" -> "true",
  "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
  "inferSchema" -> "true",
  "mode" -> "FAILFAST"))
    .csv("s3a://landsat-pds/scene_list.gz")
csvdata.write.mode("overwrite").orc("s3a://hwdev-stevel-demo2/landsatOrc")

Mime
View raw message