spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Mahadevan <>
Subject Re: custom rdd - do I need a hadoop input format?
Date Tue, 17 Sep 2019 17:46:20 GMT
You can do it with custom RDD implementation.
You will mainly implement "getPartitions" - the logic to split your input
into partitions and "compute" to compute and return the values from the

On Tue, 17 Sep 2019 at 08:47, Marcelo Valle <> wrote:

> Just to be more clear about my requirements, what I have is actually a
> custom format, with header, summary and multi line blocks. I want to create
> tasks per block and no per line.I already have a library that reads an
> InputStream and outputs an Iterator of Block, but now I need to integrate
> this with spark
> On Tue, 17 Sep 2019 at 16:28, Marcelo Valle <>
> wrote:
>> Hi,
>> I want to create a custom RDD which will read n lines in sequence from a
>> file, which I call a block, and each block should be converted to a spark
>> dataframe to be processed in parallel.
>> Question - do I have to implement a custom hadoop input format to achieve
>> this? Or is it possible to do it only with RDD APIs?
>> Thanks,
>> Marcelo.
> This email is confidential [and may be protected by legal privilege]. If
> you are not the intended recipient, please do not copy or disclose its
> content but contact the sender immediately upon receipt.
> KTech Services Ltd is registered in England as company number 10704940.
> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,
> United Kingdom

View raw message