From Bhaarat Sharma <>
Subject How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)
Date Sat, 30 Jul 2016 22:57:54 GMT
I am reading bunch of files in PySpark using binaryFiles. Then I want to
get the number of bytes for each file and write this number to an HDFS file
with the corresponding name.


if directory /myimages has one.jpg, two.jpg, and three.jpg then I want
three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS
with a number in each. The number will specify the length of bytes.

Here is what I've done thus far:

from pyspark import SparkContext
import numpy as np

sc = SparkContext("local", "test")

def bytes_length(rawdata):
        length = len(np.asarray(bytearray(rawdata),dtype=np.uint8))
        return length

images = sc.binaryFiles("/root/sift_images_test/*.jpg"), contents):

However, doing this creates a single file in HDFS:

$ hadoop fs -cat /tmp/somfile/part-00000


Instead I want /tmp/somefile in HDFS to have three files:

one-success.txt with value 113212
two-success.txt with value 144926
three-success.txt with value 178923

Is it possible to achieve what I'm after? I don't want to write files
to local file system and them put them in HDFS. Instead, I want to use
the saveAsTextFile method on the RDD directly.

