spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Someshwar Kale <skale1...@gmail.com>
Subject Re: Reading 7z file in spark
Date Wed, 15 Jan 2020 01:36:10 GMT
I would suggest to use other compression technique which is splittable for
eg. Bzip2, lzo, lz4.

On Wed, Jan 15, 2020, 1:32 AM Enrico Minack <mail@enrico.minack.dev> wrote:

> Hi,
>
> Spark does not support 7z natively, but you can read any file in Spark:
>
> def read(stream: PortableDataStream): Iterator[String] = { Seq(stream.getPath()).iterator
}
>
> spark.sparkContext
>   .binaryFiles("*.7z")
>   .flatMap(file => read(file._2))
>   .toDF("path")
>   .show(false)
>
> This scales with the number of files. A single large 7z file would not
> scale well (a single partition).
>
> Any file that matches *.7z will be loaded via the read(stream:
> PortableDataStream) method, which returns an iterator over the rows. This
> method is executed on the executor and can implement the 7z specific code,
> which is independent of Spark and should not be too hard (here it does not
> open the input stream but returns the path only).
>
> If you are planning to read the same files more than once, then it would
> be worth to first uncompress and convert them into files Spark supports.
> Then Spark can scale much better.
>
> Regards,
> Enrico
>
>
> Am 13.01.20 um 13:31 schrieb HARSH TAKKAR:
>
> Hi,
>
>
> Is it possible to read 7z compressed file in spark?
>
>
> Kind Regards
> Harsh Takkar
>
>
>

Mime
View raw message