spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Anyway to make RDD preserve input directories structures?
Date Fri, 16 Jan 2015 03:57:58 GMT
Maybe you are saying you already do this, but it's perfectly possible
to process as many RDDs as you like in parallel on the driver. That
may allow your current approach to eat up as much parallelism as you
like. I'm not sure if that's what you are describing with "submit
multi applications" but you do not need separate Spark applications,
just something like a local parallel Scala collection invoking the RDD

Yes, these operations do not and should not 'append' data to existing files.

Of course this has the downside of all that overhead of processing
every file individually rather than as one big job. Although you may
not be able to design this differently, I suggest that ideally you do
not encode this information in directory structure, but in the data
itself. I know sometimes this is important for downstream tools
though, as it is part of how partitioning is defined for example.

On Fri, Jan 16, 2015 at 2:15 AM, 逸君曹 <> wrote:
> say there's some logs:
> s3://log-collections/sys1/20141212/nginx.gz
> s3://log-collections/sys1/20141213/nginx-part-1.gz
> s3://log-collections/sys1/20141213/nginx-part-2.gz
> I have a function that parse the logs for later analysis.
> I want to parse all the files. So I do this:
> logs = sc.textFile('s3://log-collections/sys1/')
> BUT, this will destroy the date separate naming shema.resulting:
> s3://parsed-logs/part-0000
> s3://parsed-logs/part-0001
> ...
> And the worse part is that when I got a new day logs.
> It seems rdd.saveAsTextFile couldn't just append the new day's log.
> So I create a RDD for every single file.and parse it, save to the name I
> this:
> one = sc.textFile("s3://log-collections/sys1/20141213/nginx-part-1.gz")
> which resulting:
> s3://parsed-logs/20141212/01/part-0000
> s3://parsed-logs/20141213/01/part-0000
> s3://parsed-logs/20141213/01/part-0001
> s3://parsed-logs/20141213/02/part-0000
> s3://parsed-logs/20141213/02/part-0001
> s3://parsed-logs/20141213/02/part-0002
> And when a new day's log comes. I just process that day's logs and put to
> the proper directory(or "key")
> THE PROBLEM is this way I have to create a seperated RDD for every single
> file.
> which couldn't take advantage of Spark's functionality of automatic parallel
> processing.[I'm trying to submit multi applications for each batch of
> files.]
> Or maybe I'd better use hadoop streaming for this ?
> Any suggestions?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message