spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Selvam Raman <sel...@gmail.com>
Subject Re: Spark CSV skip lines
Date Sat, 10 Sep 2016 16:50:13 GMT
Hi,

I saw this two option already anyway thanks for the idea.

i am using wholetext file to read my data(cause there are  \n middle of it)
and using opencsv to parse the data. In my data first two lines are just
some report. how can i eliminate.

*How to eliminate first two lines after reading from wholetextfiles.*

val test = wholeTextFiles.flatMap{ case (_, txt) =>
     | val reader = new CSVReader(new StringReader(txt));
     | reader.readAll().map(data =>
Row(data(3),data(4),data(7),data(9),data(14)))}

The above code throws arrayoutofbounce exception for empty line and report
line.


On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon <gurwls223@gmail.com> wrote:

> Hi Selvam,
>
> If your report is commented with any character (e.g. #), you can skip
> these lines via comment option [1].
>
> If you are using Spark 1.x, then you might be able to do this by manually
> skipping from the RDD and then making this to DataFrame as below:
>
> I haven’t tested this but I think this should work.
>
> val rdd = sparkContext.textFile("...")
> val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
>   if (idx == 0) {
>     iter.drop(10)
>   } else {
>     iter
>   }
> }
> val df = new CsvParser().csvRdd(sqlContext, filteredRdd)
>
> If you are using Spark 2.0, then it seems there is no way to manually
> modifying the source data because loading existing RDD or DataSet[String]
> to DataFrame is not yet supported.
>
> There is an issue open[2]. I hope this is helpful.
>
> Thanks.
>
> [1] https://github.com/apache/spark/blob/27209252f09ff73c58e60c6df8aaba
> 73b308088c/sql/core/src/main/scala/org/apache/spark/sql/
> DataFrameReader.scala#L369
> [2] https://issues.apache.org/jira/browse/SPARK-15463
>
>
> ​
>
>
> On 10 Sep 2016 6:14 p.m., "Selvam Raman" <selmna@gmail.com> wrote:
>
>> Hi,
>>
>> I am using spark csv to read csv file. The issue is my files first n
>> lines contains some report and followed by actual data (header and rest of
>> the data).
>>
>> So how can i skip first n lines in spark csv. I dont have any specific
>> comment character in the first byte.
>>
>> Please give me some idea.
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Mime
View raw message