spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)
Date Sun, 31 Jul 2016 02:19:44 GMT
This sounds a bad idea, given hdfs does not work well with small files.

On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaarat.s@gmail.com> wrote:

> I am reading bunch of files in PySpark using binaryFiles. Then I want to
> get the number of bytes for each file and write this number to an HDFS file
> with the corresponding name.
>
> Example:
>
> if directory /myimages has one.jpg, two.jpg, and three.jpg then I want
> three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS
> with a number in each. The number will specify the length of bytes.
>
> Here is what I've done thus far:
>
> from pyspark import SparkContext
> import numpy as np
>
> sc = SparkContext("local", "test")
>
> def bytes_length(rawdata):
>         length = len(np.asarray(bytearray(rawdata),dtype=np.uint8))
>         return length
>
> images = sc.binaryFiles("/root/sift_images_test/*.jpg")
> images.map(lambda(filename, contents): bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile")
>
>
> However, doing this creates a single file in HDFS:
>
> $ hadoop fs -cat /tmp/somfile/part-00000
>
> 113212
> 144926
> 178923
>
> Instead I want /tmp/somefile in HDFS to have three files:
>
> one-success.txt with value 113212
> two-success.txt with value 144926
> three-success.txt with value 178923
>
> Is it possible to achieve what I'm after? I don't want to write files to local file system
and them put them in HDFS. Instead, I want to use the saveAsTextFile method on the RDD directly.
>
>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message