spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mytramesh <turlapati.ram...@gmail.com>
Subject Re: Implementing .zip file codec
Date Thu, 09 Aug 2018 16:36:34 GMT
Spark doesn't support zip file reading directly since this not distributable
file . 

Read using Java.uti.zipInputStream api and prepare rdd ..  ( 4GB Limit ) 

import java.util.zip.ZipInputStream
import scala.io.Source
import org.apache.spark.input.PortableDataStream

var zipPath = "s3://.... ABC.zip"

val rdd= sc.binaryFiles(zipPath).flatMap((file: (String,
PortableDataStream)) => {
var zipStream = new ZipInputStream(file._2.open)
val entry = zipStream.getNextEntry
var iter: Iterator[String] = null

iter = Source.fromInputStream(zipStream, "ISO_8859_1").getLines

iter
})


if zip file more than 4 GB use 
import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message