spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Patel <>
Subject Best approach for processing all files parallelly
Date Thu, 06 Oct 2016 11:26:49 GMT
My Pyspark program is currently identifies the list of the files from a
directory (Using Python Popen command taking hadoop fs -ls arguments).  For
each file, a Dataframe is created and processed. This is sequeatial. How to
process all files paralelly?  Please note that every file in the directory
has different schema.  So, depending on the file name, different logic is
used for each file. So, I cannot really create one Dataframe for all these
files and iterate each row.  Using wholeTextFiles seems to be good approach
for me.  But, I am not sure how to create DataFrame from this.  For
example, Is there a way to do this way do something like below.

def createDFProcess(myrdd):
    df =

whfiles = sc.wholeTextFiles('/input/dir1').toDF(['fname', 'fcontent']) file: file.fcontent).foreach(createDFProcess)

Above code does not work.  I get an error 'TypeError: 'JavaPackage' object
is not callable'.  How to make it work?

Or is there a better approach?


View raw message