spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radhwane Chebaane <r.cheba...@mindlytix.com>
Subject Re: Load multiple CSV from different paths
Date Wed, 05 Jul 2017 14:29:19 GMT
Hi,

Referring to spark 2.x documentation, in
org.apache.spark.sql.DataFrameReader  you have this function:
def csv(paths: String*): DataFrame
<http://spark.apache.org/docs/2.1.0/api/scala/org/apache/spark/sql/package.html#DataFrame=org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]>

So you can unpack your Array of paths like this:

val sources = paths.split(',').toSeq

spark.read.option("header", "false")
            .schema(custom_schema)
            .option('delimiter', '\t')
            .option('mode', 'DROPMALFORMED')
            .csv(sources: _*)


In spark 1.6.x I think this may work with spark-csv
<https://github.com/databricks/spark-csv> :

spark.read.format("com.databricks.spark.csv").option("header", "false")
            .schema(custom_schema)
            .option('delimiter', '\t')
            .option('mode', 'DROPMALFORMED')
            .load(sources: _*)



Cheers,
Radhwane Chebaane

2017-07-05 16:08 GMT+02:00 Didac Gil <didacgil9dev@gmail.com>:

> Hi,
>
> Do you know any simple way to load multiple csv files (same schema) that
> are in different paths?
> Wildcards are not a solution, as I want to load specific csv files from
> different folders.
>
> I came across a solution (https://stackoverflow.com/
> questions/37639956/how-to-import-multiple-csv-files-in-a-single-load) that
> suggests something like
>
> spark.read.format("csv").option("header", "false")
>             .schema(custom_schema)
>             .option('delimiter', '\t')
>             .option('mode', 'DROPMALFORMED')
>             .load(paths.split(','))
>
> However, even it mentions that this approach would work in Spark 2.x, I
> don’t find an implementation of load that accepts an Array[String] as an
> input parameter.
>
> Thanks in advance for your help.
>
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacgil9@gmail.com
> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
> Skype: didac.gil.de.la.iglesia
>
>


-- 

[image: photo] Radhwane Chebaane
Distributed systems engineer, Mindlytix

Mail: radhwane@mindlytix.com  <radhwane@mindlytix.com>
Mobile: +33 695 588 906 <+33+695+588+906>
<https://mail.google.com/mail/u/0/#>
Skype: rad.cheb  <https://mail.google.com/mail/u/0/#>
LinkedIn <https://fr.linkedin.com/in/radhwane-chebaane-483b3a7b>
<https://mail.google.com/mail/u/0/#>

Mime
View raw message