spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@gmail.com>
Subject Re: Processing millions of messages in milliseconds -- Architecture guide required
Date Tue, 19 Apr 2016 07:47:31 GMT
This is too big of a topic.  For starters, what is the latency between you
obtain the data and the data is available for analysis?  Obviously if this
is < 5 minutes, you probably need a streaming solution.  How fast the
"micro batches of seconds" need to be available for analysis?  Can the
problem be easily partitioned and how flexible are you in the # of machines
for your solution?  Are you OK with availability fat tails?

Another question, how big is an individual message in bytes?  XML/JSON are
extremely inefficient and with "10 mils of messages" you might hit other
bottlenecks like network unless you convert them into something more
machine-like like Protobuf, Avro or Thrift.

>From the top, look at Kafka, Flume, Storm.

To "serve the historical  data in milliseconds (may be upto 30 days of
data)" you'll need to cache data in memory.  The question, again, is how
often the data change.  You might look into Lambda architectures.

On Mon, Apr 18, 2016 at 10:21 PM, Prashant Sharma <scrapcodes@gmail.com>
wrote:

> Hello Deepak,
>
> It is not clear what you want to do. Are you talking about spark streaming
> ? It is possible to process historical data in Spark batch mode too. You
> can add a timestamp field in xml/json. Spark documentation is at
> spark.apache.org. Spark has good inbuilt features to process json and
> xml[1] messages.
>
> Thanks,
> Prashant Sharma
>
> 1. https://github.com/databricks/spark-xml
>
> On Tue, Apr 19, 2016 at 10:31 AM, Deepak Sharma <deepakmca05@gmail.com>
> wrote:
>
>> Hi all,
>> I am looking for an architecture to ingest 10 mils of messages in the
>> micro batches of seconds.
>> If anyone has worked on similar kind of architecture  , can you please
>> point me to any documentation around the same like what should be the
>> architecture , which all components/big data ecosystem tools should i
>> consider etc.
>> The messages has to be in xml/json format , a preprocessor engine or
>> message enhancer and then finally a processor.
>> I thought about using data cache as well for serving the data
>> The data cache should have the capability to serve the historical  data
>> in milliseconds (may be upto 30 days of data)
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>>
>>
--
Alex Kozlov
alexvk@gmail.com

Mime
View raw message