At the end of a set of computation I have a JavaRDD<String> . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable.
the saveAsText command creates multiple files with names like part0000, part0001 .... This was bed behavior in Hadoop for final output and is also bad for Spark.
A more general issue is whether is it possible to convert a JavaRDD into an iterator or iterable over then entire data set without using collect or holding all data in memory.
In many problems where it is desirable to parallelize intermediate steps but use a single process for handling the final result this could be very useful.