spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuporno Choudhury <>
Subject [pyspark] Read multiple files parallely into a single dataframe
Date Fri, 04 May 2018 09:38:26 GMT

I want to read multiple files parallely into 1 dataframe. But the files
have random names and cannot confirm to any pattern (so I can't use
wildcard). Also, the files can be in different directories.
If I provide the file names in a list to the dataframe reader, it reads
then sequentially.
This reads the files sequentially. What can I do to read the files
I noticed that spark reads files parallely if provided directly the
directory location. How can that be extended to multiple random files?
Suppose if my system has 4 cores, how can I make spark read 4 files at a

Please suggest.

View raw message