say there's some logs:
I have a function that parse the logs for later analysis.
I want to parse all the files. So I do this:
logs = sc.textFile('s3://log-collections/sys1/')
BUT, this will destroy the date separate naming shema.resulting:
And the worse part is that when I got a new day logs.
It seems rdd.saveAsTextFile couldn't just append the new day's log.
So I create a RDD for every single file.and parse it, save to the name I want.like this:
one = sc.textFile("s3://log-collections/sys1/20141213/nginx-part-1.gz")
And when a new day's log comes. I just process that day's logs and put to the proper directory(or "key")
THE PROBLEM is this way I have to create a seperated RDD for every single file.
which couldn't take advantage of Spark's functionality of automatic parallel processing.[I'm trying to submit multi applications for each batch of files.]
Or maybe I'd better use hadoop streaming for this ?