spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <Jayesh.Lalw...@capitalone.com>
Subject Re: Converting binary files
Date Wed, 03 Jan 2018 02:48:28 GMT
You can repartition your dataframe into 1 partition and all the data will land into one partition.
However, doing this is perilious because you will end up with all your data on one node, and
if you have too much data you will run out of memory. In fact, anytime you are thinking about
putting data in a single file, you should ask yourself “Does this data fit into memory?”

The reason why Spark is geared towards reading and writing data in a partitioned manner is
because fundamentally, partitioning data is how you scale your applications. Partitioned data
allows Spark (or really any application that is designed to scale on a cluster) to read data
in parallel, process it and spit out, without any bottlenecking. Humans prefer all their data
in a single file/table, because humans have a limited ability of keeping track of multitude
of files. Grid enabled software hate single files, simply because there is no good way for
2 nodes to read a large file without having some sort of bottlenecking

Imagine a data processing pipeline that starts with some sort of ingestion and transformation
at one end, which feeds into several analytical processes. Usually there are humans at the
end who are looking at the results of the analytics.  These humans love to get their analytics
in a dashboard that gives them a high-level view of the data. However, all the data processing
systems that go from input to analytics, prefer their data to be cut up into bite sized chunks

From: Christopher Piggott <cpiggott@gmail.com>
Date: Saturday, December 30, 2017 at 3:45 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Converting binary files

I have been searching for examples, but not finding exactly what I need.

I am looking for the paradigm for using spark 2.2 to convert a bunch of binary files into
a bunch of different binary files.  I'm starting with:

   val files = spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input<http://1.2.3.4/input>")

then convert them:

   val converted = files.map {   case (filename, content) =>   ( filename -> convert(content)
}

but I don't really want to save by 'partition', I want to save the file using the original
name but in a different directory.e.g. "converted/*"

I'm not quite sure how I'm supposed to do this within the framework of what's available to
me in SparkContext.  Do I need to do it myself using the HDFS api?

It would seem like this would be a pretty normal thing to do.  Imagine for instance I were
saying take a bunch of binary files and compress them, and save the compressed output to a
different directory.  I feel like I'm missing something fundamental here.

--C



________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
View raw message