spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ehrlich <>
Subject Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)
Date Mon, 01 Aug 2016 00:18:29 GMT
You could write each image to a different directory instead of a different file. That can be
done by filtering the RDD into one RDD for each image and then saving each. That might not
be what you’re after though, in terms of space and speed efficiency. Another way would be
to save them multiple outputs into one parquet (or text) file. There might be information
on the image you can partition on (probably by some timestamp) to make lookups faster.

> On Jul 30, 2016, at 8:01 PM, Bhaarat Sharma <> wrote:
> I am just trying to do this as a proof of concept. The actual content of the files will
be quite bit. 
> I'm having problem using foreach or something similar on an RDD. 
> sc.binaryFiles("/root/sift_images_test/*.jpg")
> returns
> ("filename1", bytes)
> ("filname2",bytes)
> I'm wondering if there is a do processing one each of these (process in this case is
just getting the bytes length but will be something else in real world) and then write the
contents to separate HDFS files. 
> If this doesn't make sense, would it make more sense to have all contents in a single
HDFS file?
> On Sat, Jul 30, 2016 at 10:19 PM, ayan guha < <>>
> This sounds a bad idea, given hdfs does not work well with small files.
> On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma < <>>
> I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number
of bytes for each file and write this number to an HDFS file with the corresponding name.

> Example:
> if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg,
two-success.jpg, and three-success.jpg in HDFS with a number in each. The number will specify
the length of bytes. 
> Here is what I've done thus far:
> from pyspark import SparkContext
> import numpy as np
> sc = SparkContext("local", "test")
> def bytes_length(rawdata):
>         length = len(np.asarray(bytearray(rawdata),dtype=np.uint8))
>         return length
> images = sc.binaryFiles("/root/sift_images_test/*.jpg")
>, contents): bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile")
> However, doing this creates a single file in HDFS:
> $ hadoop fs -cat /tmp/somfile/part-00000
> 113212
> 144926
> 178923
> Instead I want /tmp/somefile in HDFS to have three files:
> one-success.txt with value 113212
> two-success.txt with value 144926
> three-success.txt with value 178923
> Is it possible to achieve what I'm after? I don't want to write files to local file system
and them put them in HDFS. Instead, I want to use the saveAsTextFile method on the RDD directly.
> -- 
> Best Regards,
> Ayan Guha

View raw message