spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolay Zhebet <phpap...@gmail.com>
Subject Re: multiple spark streaming contexts
Date Mon, 01 Aug 2016 09:28:20 GMT
You always can save data in hdfs where you need, and you can controll
paralelizm in your app by configuring --driver-cores and --driver-memory.This
approach can maintain Spark master and it can controll your failure issues,
data locality and etc. But if you want to controll it by self with
"Executors.newFixedThreadPool(threadNum)" or other ways, i think you can
catch problems with yarn/mesos job recovery and failure mechanizm.
I wish you good luck in your struggle of parallelism )) This is an
interesting question!)

2016-08-01 10:41 GMT+03:00 Sumit Khanna <sumit.khanna@askme.in>:

> Hey Nikolay,
>
> I know the approach, but this pretty much doesnt fit the bill for my
> usecase wherein each topic needs to be logged / persisted as a separate
> hdfs location.
>
> I am looking for something where a streaming context pertains to a topic
> and that topic only, and was wondering if I could have them all in parallel
> in one app / jar run.
>
> Thanks,
>
> On Mon, Aug 1, 2016 at 1:08 PM, Nikolay Zhebet <phpapple@gmail.com> wrote:
>
>> Hi, If you want read several kafka topics in spark-streaming job, you can
>> set names of topics splited by coma and after that you can read all
>> messages from all topics in one flow:
>>
>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>>
>> val lines = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc,
kafkaParams, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
>>
>>
>> After that you can use ".filter" function for splitting your topics and iterate messages
separately.
>>
>> val orders_paid = lines.filter(x => { x("table_name") == "kismia.orders_paid"})
>>
>> orders_paid.foreachRDD( rdd => { ....
>>
>>
>> Or you can you you if..else construction for splitting your messages by
>> names in foreachRDD:
>>
>> lines.foreachRDD((recrdd, time: Time) => {
>>
>>    recrdd.foreachPartition(part => {
>>
>>       part.foreach(item_row => {
>>
>>          if (item_row("table_name") == "kismia.orders_paid") { ...} else if (...)
{...}
>>
>> ....
>>
>>
>> 2016-08-01 9:39 GMT+03:00 Sumit Khanna <sumit.khanna@askme.in>:
>>
>>> Any ideas guys? What are the best practices for multiple streams to be
>>> processed?
>>> I could trace a few Stack overflow comments wherein they better
>>> recommend a jar separate for each stream / use case. But that isn't pretty
>>> much what I want, as in it's better if one / multiple spark streaming
>>> contexts can all be handled well within a single jar.
>>>
>>> Guys please reply,
>>>
>>> Awaiting,
>>>
>>> Thanks,
>>> Sumit
>>>
>>> On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna <sumit.khanna@askme.in>
>>> wrote:
>>>
>>>> Any ideas on this one guys ?
>>>>
>>>> I can do a sample run but can't be sure of imminent problems if any?
>>>> How can I ensure different batchDuration etc etc in here, per
>>>> StreamingContext.
>>>>
>>>> Thanks,
>>>>
>>>> On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna <sumit.khanna@askme.in>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> Was wondering if I could create multiple spark stream contexts in my
>>>>> application (e.g instantiating a worker actor per topic and it has its
own
>>>>> streaming context its own batch duration everything).
>>>>>
>>>>> What are the caveats if any?
>>>>> What are the best practices?
>>>>>
>>>>> Have googled half heartedly on the same but the air isn't pretty much
>>>>> demystified yet. I could skim through something like
>>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations
>>>>>
>>>>>
>>>>> http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker
>>>>>
>>>>> Thanks in Advance!
>>>>> Sumit
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message