spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Su She <suhsheka...@gmail.com>
Subject Why are there different "parts" in my CSV?
Date Fri, 13 Feb 2015 06:29:33 GMT
Hello Everyone,

I am writing simple word counts to hdfs using
messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
String.class, (Class) TextOutputFormat.class);

1) However, each 2 seconds I getting a new *directory *that is titled as a
csv. So i'll have test.csv, which will be a directory that has two files
inside of it called part-00000 and part 00001 (something like that). This
obv makes it very hard for me to read the data stored in the csv files. I
am wondering if there is a better way to store the JavaPairRecieverDStream
and JavaPairDStream?

2) I know there is a copy/merge hadoop api for merging files...can this be
done inside java? I am not sure the logic behind this api if I am using
spark streaming which is constantly making new files.

Thanks a lot for the help!

Mime
View raw message