spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fregly <ch...@fregly.com>
Subject Re: Multiple Kinesis Streams in a single Streaming job
Date Fri, 15 May 2015 04:01:37 GMT
another option (not really recommended, but worth mentioning) would be to change the region
of dynamodb to be separate from the other stream - and even separate from the stream itself.

this isn't available right now, but will be in Spark 1.4.

> On May 14, 2015, at 6:47 PM, Erich Ess <erich@simplerelevance.com> wrote:
> 
> Hi Tathagata,
> 
> I think that's exactly what's happening.
> 
> The error message is: "com.amazonaws.services.kinesis.model.InvalidArgumentException:
StartingSequenceNumber 49550673839151225431779125105915140284622031848663416866 used in GetShardIterator
on shard shardId-000000000002 in stream erich-test under account xxxxxxx is invalid because
it did not come from this stream".
> 
> I looked at the DynamoDB table and each job has single table and that table does not
contain any stream identification information, only shard checkpointing data.  I think the
error is that when it tries to read from stream B, it's using checkpointing data for stream
A and errors out.  So it appears, at first glance, that currently you can't read from multiple
Kinesis streams in a single job.  I haven't tried this, but it might be possible for this
to work if I force each stream to have different shard IDs so there is no ambiguity in the
DynamoDB table; however, that's clearly not a feasible production solution.
> 
> Thanks,
> -Erich
> 
>> On Thu, May 14, 2015 at 8:34 PM, Tathagata Das <tdas@databricks.com> wrote:
>> A possible problem may be that the kinesis stream in 1.3 uses the SparkContext app
name, as the Kinesis Application Name, that is used by the Kinesis Client Library to save
checkpoints in DynamoDB. Since both kinesis DStreams are using the Kinesis application name
(as they are in the same StreamingContext / SparkContext / Spark app name), KCL may be doing
weird overwriting checkpoint information of both Kinesis streams into the same DynamoDB table.
Either ways, this is going to be fixed in Spark 1.4. 
>> 
>> On Thu, May 14, 2015 at 4:10 PM, Chris Fregly <chris@fregly.com> wrote:
>>> have you tried to union the 2 streams per the KinesisWordCountASL example where
2 streams (against the same Kinesis stream in this case) are created and union'd?
>>> 
>>> it should work the same way - including union() of streams from totally different
source types (kafka, kinesis, flume).
>>> 
>>> 
>>> 
>>>> On Thu, May 14, 2015 at 2:07 PM, Tathagata Das <tdas@databricks.com>
wrote:
>>>> What is the error you are seeing?
>>>> 
>>>> TD
>>>> 
>>>>> On Thu, May 14, 2015 at 9:00 AM, Erich Ess <erich@simplerelevance.com>
wrote:
>>>>> Hi,
>>>>> 
>>>>> Is it possible to setup streams from multiple Kinesis streams and process
>>>>> them in a single job?  From what I have read, this should be possible,
>>>>> however, the Kinesis layer errors out whenever I try to receive from
more
>>>>> than a single Kinesis Stream.
>>>>> 
>>>>> Here is the code.  Currently, I am focused on just getting receivers
setup
>>>>> and working for the two Kinesis Streams, as such, this code just attempts
to
>>>>> print out the contents of both streams:
>>>>> 
>>>>> implicit val formats = Serialization.formats(NoTypeHints)
>>>>> 
>>>>> val conf = new SparkConf().setMaster("local[*]").setAppName("test")
>>>>> val ssc = new StreamingContext(conf, Seconds(1))
>>>>> 
>>>>> val rawStream = KinesisUtils.createStream(ssc, "erich-test",
>>>>> "kinesis.us-east-1.amazonaws.com", Duration(1000),
>>>>> InitialPositionInStream.TRIM_HORIZON, StorageLevel.MEMORY_ONLY)
>>>>> rawStream.map(msg => new String(msg)).print
>>>>> 
>>>>> val loaderStream = KinesisUtils.createStream(
>>>>>   ssc,
>>>>>   "dev-loader",
>>>>>   "kinesis.us-east-1.amazonaws.com",
>>>>>   Duration(1000),
>>>>>   InitialPositionInStream.TRIM_HORIZON,
>>>>>   StorageLevel.MEMORY_ONLY)
>>>>> 
>>>>> val loader = loaderStream.map(msg => new String(msg)).print
>>>>> 
>>>>> ssc.start()
>>>>> 
>>>>> Thanks,
>>>>> -Erich
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kinesis-Streams-in-a-single-Streaming-job-tp22889.html
>>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
> 
> 
> 
> -- 
> Erich Ess | CTO
> c. 310-703-6058
> @SimpleRelevance | 130 E Randolph, Ste 1650 | Chicago, IL 60601
> Machine Learning For Marketers
> 
> Named a top startup to watch in Crain's — View the Article.
> 
> SimpleRelevance.com | Facebook | Twitter | Blog

Mime
View raw message