spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albert Vila <>
Subject Re: Is Spark streaming suitable for our architecture?
Date Thu, 23 Oct 2014 11:56:07 GMT
Hi Jayant,

On 23 October 2014 11:14, Jayant Shekhar <> wrote:

> Hi Albert,
> Have a couple of questions:
>    - You mentioned near real-time. What exactly is your SLA for
>    processing each document?
> The minimum the best :). Right now it's between 30s - 5m, but I would like
to have something stable arround 1-2m if possible. Taking into account that
the system should be able to scale to 50M - 100M documents.

>    - Which crawler are you using and are you looking to bring in Hadoop
>    into your overall workflow. You might want to read up on how network
>    traffic is minimized/managed on the Hadoop cluster - as you had run into
>    network issues with your current architecture.
> Everything is developed by us. The network issues were not related to the
crawler itself, they were related to the documents we were moving around
the system to be processed for each workflow stage. And yes, we are
currently researching if we can introduce Spark streaming to be able to
scale and execute all workflow stages and use Hdfs/Cassandra to store the

Should we use the DStream persist function (if we use every document as a
RDD), in order to reuse the same data or it's better to create new
DStreams? On each step we add additional data to the document, for example
on the language extraction, we begin with a document without language, and
we output the document with a new language field.


> Thanks!
> On Thu, Oct 23, 2014 at 12:07 AM, Albert Vila <>
> wrote:
>> Hi
>> I'm evaluating Spark streaming to see if it fits to scale or current
>> architecture.
>> We are currently downloading and processing 6M documents per day from
>> online and social media. We have a different workflow for each type of
>> document, but some of the steps are keyword extraction, language detection,
>> clustering, classification, indexation, .... We are using Gearman to
>> dispatch the job to workers and we have some queues on a database.
>> Everything is in near real time.
>> I'm wondering if we could integrate Spark streaming on the current
>> workflow and if it's feasible. One of our main discussions are if we have
>> to go to a fully distributed architecture or to a semi-distributed one. I
>> mean, distribute everything or process some steps on the same machine
>> (crawling, keyword extraction, language detection, indexation). We don't
>> know which one scales more, each one has pros and cont.
>> Now we have a semi-distributed one as we had network problems taking into
>> account the amount of data we were moving around. So now, all documents
>> crawled on server X, later on are dispatched through Gearman to the same
>> server. What we dispatch on Gearman is only the document id, and the
>> document data remains on the crawling server on a Memcached, so the network
>> traffic is keep at minimum.
>> It's feasible to remove all database queues and Gearman and move to Spark
>> streaming? We are evaluating to add Kakta to the system too.
>> Is anyone using Spark streaming for a system like ours?
>> Should we worry about the network traffic? or it's something Spark can
>> manage without problems. Every document is arround 50k (300Gb a day +/-).
>> If we wanted to isolate some steps to be processed on the same machine/s
>> (or give priority), is something we could do with Spark?
>> Any help or comment will be appreciate. And If someone has had a similar
>> problem and has knowledge about the architecture approach will be more than
>> welcomed.
>> Thanks

View raw message