spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ruben <>
Subject Re: use case reading files split per id
Date Wed, 16 Nov 2016 17:28:18 GMT
Yes that binary files function looks interesting, thanks for the tip.

Some followup questions:

- I always wonder when people are talking about 'small' files and 'large'
files. Is there any rule of thumb when these things apply? Are small files
those which can fit completely in memory on the node and large files do not?

- If it works similarly to wholeTextFiles it will give me tuples like this:
(/base/id1/file1, contentA)
(/base/id1/file2, contentB)
(/base/id2/file1, contentC)
(/base/id2/file2, contentD)

since I want to end up with tuples like:
(id1, parsedContentA ++ parsedContentB ++ ...)
(id2, parsedContentC ++ parsedContentD ++ ...)

would reduceByKey be the best function to accomplish this?
will using dataFrames give me any benefits here?
This will end up with some shuffling of parsedContent's which are
List[(Timestamp, RecordData)] right? but I guess this is not really
something which can be avoided.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message