spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhishek Anand <abhis.anan...@gmail.com>
Subject Re: removing header from csv file
Date Tue, 03 May 2016 07:23:31 GMT
You can use this function to remove the header from your dataset(applicable
to RDD)

def dropHeader(data: RDD[String]): RDD[String] = {
    data.mapPartitionsWithIndex((idx, lines) => {
      if (idx == 0) {
        lines.drop(1)
      }
      lines
    })
    }


Abhi

On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni <mmistroni@gmail.com>
wrote:

> If u r using Scala api you can do
> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
>
> Maybe a little bit complicated but will do the trick
> As per spark CSV, you will get back a data frame which you can reconduct
> to rdd. .
> Hth
> Marco
> On 27 Apr 2016 6:59 am, "nihed mbarek" <nihedmm@gmail.com> wrote:
>
>> You can add a filter with string that you are sure available only in the
>> header
>>
>> Le mercredi 27 avril 2016, Divya Gehlot <divya.htconex@gmail.com> a
>> écrit :
>>
>>> yes you can remove the headers by removing the first row
>>>
>>> can first() or head() to do that
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>> On 27 April 2016 at 13:24, Ashutosh Kumar <kmr.ashutosh16@gmail.com>
>>> wrote:
>>>
>>>> I see there is a library spark-csv which can be used for removing
>>>> header and processing of csv files. But it seems it works with sqlcontext
>>>> only. Is there a way to remove header from csv files without sqlcontext ?
>>>>
>>>> Thanks
>>>> Ashutosh
>>>>
>>>
>>>
>>
>> --
>>
>> M'BAREK Med Nihed,
>> Fedora Ambassador, TUNISIA, Northern Africa
>> http://www.nihed.com
>>
>> <http://tn.linkedin.com/in/nihed>
>>
>>
>>

Mime
View raw message