spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alonso Isidoro Roman <alons...@gmail.com>
Subject Re: Spark Streaming-- for each new file in HDFS
Date Wed, 05 Oct 2016 09:56:34 GMT
Why flume isn't an option here?

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-10-05 11:14 GMT+02:00 Kappaganthu, Sivaram (ES) <
Sivaram.Kappaganthu@adp.com>:

> Hi Franke,
>
>
>
> Thanks for your reply. I am trying this and  doing as follows.
>
>
>
> Let the third party application 1) dumps the original file in a directory
> and  .upload file in another directory.
>
> I am writing logic to listen to  directory that contains .upload files.
>
>
>
> Here I need to map the name of the file in both the directories. Could you
> please suggest how to get the filename in streaming.
>
>
>
> val sc = new SparkContext("local[*]", "test")
>
> val ssc = new StreamingContext(sc, Seconds(4))
>
> val dStream = ssc.textFileStream(pathOfDirToStream)
>
> dStream.foreachRDD { eventsRdd => */* How to get the file name */* }
>
>
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com]
> *Sent:* Thursday, September 15, 2016 11:02 PM
> *To:* Kappaganthu, Sivaram (ES)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Streaming-- for each new file in HDFS
>
>
>
> Hi,
>
> I recommend that the third party application puts an empty file with the
> same filename as the original file, but the extension ".uploaded". This is
> an indicator that the file has been fully (!) written to the fs. Otherwise
> you risk only reading parts of the file.
>
> Then, you can have a file system listener for this .upload file.
>
>
>
> Spark streaming or Kafka are not needed/suitable, if the server is a file
> server. You can use oozie (maybe with a simple custom action) to poll for
> .uploaded files and transmit them.
>
>
> On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (ES) <
> Sivaram.Kappaganthu@ADP.com> wrote:
>
> Hello,
>
>
>
> I am a newbie to spark and I have  below requirement.
>
>
>
> Problem statement : A third party application is dumping files
> continuously in a server. Typically the count of files is 100 files  per
> hour and each file is of size less than 50MB. My application has to
>  process those files.
>
>
>
> Here
>
> 1) is it possible  for spark-stream to trigger a job after a file is
> placed instead of triggering a job at fixed batch interval?
>
> 2) If it is not possible with Spark-streaming, can we control this with
> Kafka/Flume
>
>
>
> Thanks,
>
> Sivaram
>
>
> ------------------------------
>
> This message and any attachments are intended only for the use of the
> addressee and may contain information that is privileged and confidential.
> If the reader of the message is not the intended recipient or an authorized
> representative of the intended recipient, you are hereby notified that any
> dissemination of this communication is strictly prohibited. If you have
> received this communication in error, notify the sender immediately by
> return email and delete the message and any attachments from your system.
>
>

Mime
View raw message