spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: skipping header from each file
Date Fri, 09 Jan 2015 10:03:19 GMT
I think this was already answered on stackoverflow:
http://stackoverflow.com/questions/27854919/skipping-header-file-from-each-csv-file-in-spark
where the one additional idea would be:


If there were just one header line, in the first record, then the most
efficient way to filter it out is:

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1)
else iter }

This doesn't help if of course there are many files with many header
lines inside. You can union 3 RDDs you make this way and union them.


On Fri, Jan 9, 2015 at 6:18 AM, Hafiz Mujadid <hafizmujadid00@gmail.com> wrote:
> Suppose I give three files paths to spark context to read and each file has
> schema in first row. how can we skip schema lines from headers
>
>
> val rdd=sc.textFile("file1,file2,file3");
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-file-tp21051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message