spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kappaganthu, Sivaram (ES)" <Sivaram.Kappagan...@ADP.com>
Subject RE: Spark Streaming-- for each new file in HDFS
Date Wed, 05 Oct 2016 09:14:12 GMT
Hi Franke,

Thanks for your reply. I am trying this and  doing as follows.

Let the third party application 1) dumps the original file in a directory and  .upload file
in another directory.
I am writing logic to listen to  directory that contains .upload files.

Here I need to map the name of the file in both the directories. Could you please suggest
how to get the filename in streaming.

val sc = new SparkContext("local[*]", "test")
val ssc = new StreamingContext(sc, Seconds(4))
val dStream = ssc.textFileStream(pathOfDirToStream)
dStream.foreachRDD { eventsRdd => /* How to get the file name */ }


From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: Thursday, September 15, 2016 11:02 PM
To: Kappaganthu, Sivaram (ES)
Cc: user@spark.apache.org
Subject: Re: Spark Streaming-- for each new file in HDFS

Hi,
I recommend that the third party application puts an empty file with the same filename as
the original file, but the extension ".uploaded". This is an indicator that the file has been
fully (!) written to the fs. Otherwise you risk only reading parts of the file.
Then, you can have a file system listener for this .upload file.

Spark streaming or Kafka are not needed/suitable, if the server is a file server. You can
use oozie (maybe with a simple custom action) to poll for .uploaded files and transmit them.

On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (ES) <Sivaram.Kappaganthu@ADP.com<mailto:Sivaram.Kappaganthu@ADP.com>>
wrote:
Hello,

I am a newbie to spark and I have  below requirement.

Problem statement : A third party application is dumping files continuously in a server. Typically
the count of files is 100 files  per hour and each file is of size less than 50MB. My application
has to  process those files.

Here
1) is it possible  for spark-stream to trigger a job after a file is placed instead of triggering
a job at fixed batch interval?
2) If it is not possible with Spark-streaming, can we control this with Kafka/Flume

Thanks,
Sivaram

________________________________
This message and any attachments are intended only for the use of the addressee and may contain
information that is privileged and confidential. If the reader of the message is not the intended
recipient or an authorized representative of the intended recipient, you are hereby notified
that any dissemination of this communication is strictly prohibited. If you have received
this communication in error, notify the sender immediately by return email and delete the
message and any attachments from your system.
Mime
View raw message