spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: SFTP Compressed CSV into Dataframe
Date Wed, 02 Mar 2016 19:28:03 GMT
So doing a quick look through the README & code for spark-sftp it seems
that the way this connector works is by downloading the file locally on the
driver program and this is not configurable - so you would probably need to
find a different connector (and you probably shouldn't use spark-sftp for
large files). It also seems that it might not work in a cluster environment
(which the projects README also warns about). You might have better luck
using FUSE + sftp, although you will still want your remote gzip csv file
to be split into multiple files since gzip isn't a splittable compression

On Wed, Mar 2, 2016 at 11:17 AM, Benjamin Kim <> wrote:

> I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV
> file? I am able to download the file first locally using the SFTP Client in
> the spark-sftp package. Then, I load the file into a dataframe using the
> spark-csv package, which automatically decompresses the file. I just want
> to remove the "downloading file to local" step and directly have the remote
> file decompressed, read, and loaded. Can someone give me any hints?
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Cell : 425-233-8271

View raw message