spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <>
Subject Is it possible to read file head in each partition?
Date Wed, 30 Jul 2014 04:02:51 GMT
Hi, all

We are migrating from mapreduce to spark, and encountered a problem.

Our input files are IIS logs with file head. It's easy to get the file head
if we process only one file, e.g.

val lines = sc.textFile('hdfs://*/u_ex14073011.log')
val head = lines.take(4)

Then we can write our map method using this head.

However, if we input multiple files, each of which may have a different
file head, how can we get file head for each partition?

It seems we have two options:

1. still use textFile() to get lines.

Since each partition may have a different "head", we have to write
mapPartitionsWithContext method. However we can't find a way to get the
"head" for each partition.

In our former mapreduce program, we could simply use

Path path = ((FileSplit) context.getInputSplit()).getPath()

but there seems no way in spark, since HadoopPartition which wraps
InputSplit inside HadoopRDD is a private class.

2. use wholeTextFile() to get whole contents.

 It's easy to get file head for each file, but according to the document,
this API is better for small files.

*Any suggestions on how to process these files with heads?*

View raw message