spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujit Pal <sujitatgt...@gmail.com>
Subject Re: HOw to concatenate two csv files into one RDD?
Date Fri, 26 Jun 2015 23:28:41 GMT
Hi Rex,

If the CSV files are in the same folder and there are no other files,
specifying the directory to sc.textFiles() (or equivalent) will pull in all
the files. If there are other files, you can pass in a pattern that would
capture the two files you care about (if thats possible). If neither of
these work for you, you can create individual RDDs for each file and union
them.

-sujit


On Fri, Jun 26, 2015 at 11:00 AM, Rex X <dnsring@gmail.com> wrote:

> With Python Pandas, it is easy to do concatenation of dataframes
> by combining  pandas.concat
> <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html>
> and pandas.read_csv
>
> pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in
> csvfiles])
>
> where "csvfiles" is the list of csv files
>
>
> HOw can we do this in Spark?
>
>
>

Mime
View raw message