spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Mahadevan <ar...@apache.org>
Subject Re: custom rdd - do I need a hadoop input format?
Date Tue, 17 Sep 2019 17:46:20 GMT
You can do it with custom RDD implementation.
You will mainly implement "getPartitions" - the logic to split your input
into partitions and "compute" to compute and return the values from the
executors.

On Tue, 17 Sep 2019 at 08:47, Marcelo Valle <marcelo.valle@ktech.com> wrote:

> Just to be more clear about my requirements, what I have is actually a
> custom format, with header, summary and multi line blocks. I want to create
> tasks per block and no per line.I already have a library that reads an
> InputStream and outputs an Iterator of Block, but now I need to integrate
> this with spark
>
> On Tue, 17 Sep 2019 at 16:28, Marcelo Valle <marcelo.valle@ktech.com>
> wrote:
>
>> Hi,
>>
>> I want to create a custom RDD which will read n lines in sequence from a
>> file, which I call a block, and each block should be converted to a spark
>> dataframe to be processed in parallel.
>>
>> Question - do I have to implement a custom hadoop input format to achieve
>> this? Or is it possible to do it only with RDD APIs?
>>
>> Thanks,
>> Marcelo.
>>
>
> This email is confidential [and may be protected by legal privilege]. If
> you are not the intended recipient, please do not copy or disclose its
> content but contact the sender immediately upon receipt.
>
> KTech Services Ltd is registered in England as company number 10704940.
>
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,
> United Kingdom
>

Mime
View raw message