spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micael Capitão (JIRA) <>
Subject [jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop
Date Wed, 03 Dec 2014 09:25:12 GMT


Micael Capitão commented on SPARK-3553:

I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir "hdfs:///user/altaia/cdrs/stream".
It is running on YARN and uses checkpointing. For now the application only reads the files
and prints the number of read lines.

Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

When I start the application, they are processed. When I add a new file [7] by renaming it
to end with .gz it is processed too.
[7] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz

But right after the [7], Spark Streaming reprocesses some of the initially present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz

And does not repeat anything else on the next batches. When adding yet another file, it is
not detected and stays like that.

> Spark Streaming app streams files that have already been streamed in an endless loop
> ------------------------------------------------------------------------------------
>                 Key: SPARK-3553
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.0.1
>         Environment: Ec2 cluster - YARN
>            Reporter: Ezequiel Bella
>              Labels: S3, Streaming, YARN
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data
nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each.
> The app streams from a directory in S3 which is constantly being written; this is the
line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost
 , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the way that
spark handles existing files when the process starts. We want to process just the new files
that are added after the process launched and omit the existing ones. We configured a batch
duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 or 5. We
can see in the streaming UI how the stages are executed successfully in the executors, one
for each file that is processed. But when we try to add a larger number of files, we face
a strange behavior; the application starts streaming files that have already been streamed.

> For example, I add 20 files to s3. The files are processed in 3 batches. The first batch
processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point,
but spark start repeating these phases endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message