spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jayant Shekhar <>
Subject Re: Is Spark streaming suitable for our architecture?
Date Thu, 23 Oct 2014 16:34:24 GMT
Hi Albert,

Since your latency requirement are around 1-2m, spark streaming should be a
good solution. You may also want to check out if streaming and processing
in Flume and writing out the results to HDFS, would suffice.

> crawling, keyword extraction, language detection, indexation
> On each step we add additional data to the document, for example on the
language extraction, we begin with a document without language, and we
output the document with a new language field

Can all the computations for a document be done in a single map function?
Creating fewer number of intermediate objects should help improve


On Thu, Oct 23, 2014 at 4:56 AM, Albert Vila <> wrote:

> Hi Jayant,
> On 23 October 2014 11:14, Jayant Shekhar <> wrote:
>> Hi Albert,
>> Have a couple of questions:
>>    - You mentioned near real-time. What exactly is your SLA for
>>    processing each document?
>> The minimum the best :). Right now it's between 30s - 5m, but I would
> like to have something stable arround 1-2m if possible. Taking into account
> that the system should be able to scale to 50M - 100M documents.
>>    - Which crawler are you using and are you looking to bring in Hadoop
>>    into your overall workflow. You might want to read up on how network
>>    traffic is minimized/managed on the Hadoop cluster - as you had run into
>>    network issues with your current architecture.
>> Everything is developed by us. The network issues were not related to the
> crawler itself, they were related to the documents we were moving around
> the system to be processed for each workflow stage. And yes, we are
> currently researching if we can introduce Spark streaming to be able to
> scale and execute all workflow stages and use Hdfs/Cassandra to store the
> data.
> Should we use the DStream persist function (if we use every document as a
> RDD), in order to reuse the same data or it's better to create new
> DStreams? On each step we add additional data to the document, for example
> on the language extraction, we begin with a document without language, and
> we output the document with a new language field.
> Thanks
>> Thanks!
>> On Thu, Oct 23, 2014 at 12:07 AM, Albert Vila <>
>> wrote:
>>> Hi
>>> I'm evaluating Spark streaming to see if it fits to scale or current
>>> architecture.
>>> We are currently downloading and processing 6M documents per day from
>>> online and social media. We have a different workflow for each type of
>>> document, but some of the steps are keyword extraction, language detection,
>>> clustering, classification, indexation, .... We are using Gearman to
>>> dispatch the job to workers and we have some queues on a database.
>>> Everything is in near real time.
>>> I'm wondering if we could integrate Spark streaming on the current
>>> workflow and if it's feasible. One of our main discussions are if we have
>>> to go to a fully distributed architecture or to a semi-distributed one. I
>>> mean, distribute everything or process some steps on the same machine
>>> (crawling, keyword extraction, language detection, indexation). We don't
>>> know which one scales more, each one has pros and cont.
>>> Now we have a semi-distributed one as we had network problems taking
>>> into account the amount of data we were moving around. So now, all
>>> documents crawled on server X, later on are dispatched through Gearman to
>>> the same server. What we dispatch on Gearman is only the document id, and
>>> the document data remains on the crawling server on a Memcached, so the
>>> network traffic is keep at minimum.
>>> It's feasible to remove all database queues and Gearman and move to
>>> Spark streaming? We are evaluating to add Kakta to the system too.
>>> Is anyone using Spark streaming for a system like ours?
>>> Should we worry about the network traffic? or it's something Spark can
>>> manage without problems. Every document is arround 50k (300Gb a day +/-).
>>> If we wanted to isolate some steps to be processed on the same machine/s
>>> (or give priority), is something we could do with Spark?
>>> Any help or comment will be appreciate. And If someone has had a similar
>>> problem and has knowledge about the architecture approach will be more than
>>> welcomed.
>>> Thanks

View raw message