storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kalogeropoulos, Andreas" <Andreas.Kalogeropou...@emc.com>
Subject RE: Using Storm to parse emails and creates batches
Date Tue, 01 Dec 2015 08:15:54 GMT
Hello Stephen,

I think you got I correctly. Thanks a lot for the idea.
If you have seen limitations, please send the disclaimers ☺ . For example, how did you handle
persistence of this collection ? If the third bolt failed while populating the collection
(size and time has not been reached) we just lost everything, so I need to have a status loopback
of what was really output. Right ?

Of course, if you can send me the code of your third bolt (especially the collection handling),
I’ll be grateful.
In all cases, thanks a lot for your help, even without the code, you really give me example
advice, and now I can start building something.

Kind Regards,
Andréas Kalogéropoulos

From: Stephen Powis [mailto:spowis@salesforce.com]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

From what I understand from your description, you want bolt 3 to collect results from multiple
tuples and build a single xml for them.  We've done this by essentially doing the following:

Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the collection and check
the size of the collection.  Once the size of the collection exceeds some number, we then
process all of the tuples in one go, and then ACK all of them after the processing completes.

Building on that, we've implemented an additional constraint on time.  If the collection size
> N OR if we've waited more than X seconds, process the batch.  This way your output won't
stall out if your topology has a lull in data being ingested.
And then lastly, there's a corner case where say 10 tuples come in and get held by our collection
but then no other tuples come in for a long period of time.  If no tuples enter, that means
the size and timeout checks are never executed and your bolt will hold onto those tuples for
a long time (potentially causing timeouts).  To handle this, we made use of tick tuples. 
Tick tuples essentially allow you to you to send a special tuple to your bolt every Y seconds.
 We use that to trigger checking the time constraint is checked on a regular basis (example
being send a tick tuple every 1, 5, or 10 seconds)

On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <Andreas.Kalogeropoulos@emc.com<mailto:Andreas.Kalogeropoulos@emc.com>>
wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source

2.       Add additional information (based on sender email)

3.       Create an XML based on this data, to inject in another solution

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks
possible,
But I want the XML to be as big as possible so gathering information for multiple output of
bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1
and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos


Mime
View raw message