spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: SFTP Compressed CSV into Dataframe
Date Wed, 02 Mar 2016 19:28:03 GMT
So doing a quick look through the README & code for spark-sftp it seems
that the way this connector works is by downloading the file locally on the
driver program and this is not configurable - so you would probably need to
find a different connector (and you probably shouldn't use spark-sftp for
large files). It also seems that it might not work in a cluster environment
(which the projects README also warns about). You might have better luck
using FUSE + sftp, although you will still want your remote gzip csv file
to be split into multiple files since gzip isn't a splittable compression
format.

On Wed, Mar 2, 2016 at 11:17 AM, Benjamin Kim <bbuild11@gmail.com> wrote:

> I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV
> file? I am able to download the file first locally using the SFTP Client in
> the spark-sftp package. Then, I load the file into a dataframe using the
> spark-csv package, which automatically decompresses the file. I just want
> to remove the "downloading file to local" step and directly have the remote
> file decompressed, read, and loaded. Can someone give me any hints?
>
> Thanks,
> Ben
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Mime
View raw message