spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Irving Duran <irving.du...@gmail.com>
Subject Re: [pyspark] Read multiple files parallely into a single dataframe
Date Fri, 04 May 2018 17:09:30 GMT
I could be wrong, but I think you can do a wild card.

df = spark.read.format('csv').load('/path/to/file*.csv.gz')

Thank You,

Irving Duran


On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury <
shuporno.choudhury@gmail.com> wrote:

> Hi,
>
> I want to read multiple files parallely into 1 dataframe. But the files
> have random names and cannot confirm to any pattern (so I can't use
> wildcard). Also, the files can be in different directories.
> If I provide the file names in a list to the dataframe reader, it reads
> then sequentially.
>     Eg:
> df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
> This reads the files sequentially. What can I do to read the files
> parallely?
> I noticed that spark reads files parallely if provided directly the
> directory location. How can that be extended to multiple random files?
> Suppose if my system has 4 cores, how can I make spark read 4 files at a
> time?
>
> Please suggest.
>

Mime
View raw message