spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinti Maheshwari <vinti.u...@gmail.com>
Subject Spark Streaming application designing question
Date Mon, 01 Feb 2016 23:32:56 GMT
Hi,

I am new in spark. I wanted to do spark streaming setup to retrieve key
value pairs  of below format files:

file: info1

Note: Each info file will have around of 1000 of these records. And our
system continuously generating info files. So Through spark streaming i
wanted to aggregate result.

Can we give input to spark cluster this kind of files. I am interested in
the "SF" and "DA" delimiters only, "SF" corresponds to source file . And
"DA" corresponds the ( line number,  count).

As this input data is not the line format, so is this the good idea to use
these files for the spark input or should i need to do some intermediary
stage where i need to clean these files to generate new files which will
have each record information in line instead of blocks?
Or can we achieve this in Spark itself?

What should be the right approach?



*What i wanted to achieve? :*
I wanted to get line level information. Means, to get line (As a key) and
info files (as values)
My system continuously generating info files. So Through spark streaming i
wanted to aggregate result.

Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90........................)

Do let me know if my explanation is not clear.


Thanks & Regards,
Vinti

Mime
View raw message