spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shirish (Jira)" <>
Subject [jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks
Date Wed, 18 Dec 2019 02:38:01 GMT


Shirish commented on SPARK-1476:

This is an old chain that I happen to land on to. I am interested in the following points
mentioned by [~mridulm80].  Did anyone ever get to implementing MultiOutputs map without
needing to use cache?  If not, is there a pointer I can get on how to get started.

_"_[~matei] _Interesting that you should mention about splitting output of a map into multiple

_We are actually thinking about that in a different context - akin to MultiOutputs in hadoop
or SPLIT in pig : without needing to cache the intermediate output; but directly emit values
to different blocks/rdd's based on the output of a map or some such."_


> 2GB limit in spark for blocks
> -----------------------------
>                 Key: SPARK-1476
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>         Environment: all
>            Reporter: Mridul Muralidharan
>            Priority: Critical
>         Attachments: 2g_fix_proposal.pdf
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits the size
of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle blocks
(memory mapped blocks are limited to 2gig, even though the api allows for long), ser-deser
via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial datasets.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message