storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kalogeropoulos, Andreas" <Andreas.Kalogeropou...@emc.com>
Subject RE: Using Storm to parse emails and creates batches
Date Tue, 01 Dec 2015 08:09:53 GMT
Hello Nick,

I think you are right. It is probably the state that I am not taking into consideration in
my logic.
And it is probably only in the last step.

The first is just “extract”, so as you say, I need a “filtering” bolt to just take
out what I need
The second is probably going to read from a Cassandra database, and add elements to my tuples,
based on the keys coming from y tuples.

It is the third one that needs a state, because I want to wait from multiple outputs coming
from previous bolt to work. If second step is working with a list of 10 tuples, I would need
to wait for (example) 100 of them, to create and XML with 1000 items (100 X 10 tuples).
Hence it is indeed the “in a logical window of operation (either purely time-based or tuple-based).”
That I need to implement. Thanks for the pointers.
If you have any “watch out” on best practices or limitation or example code, it is more
than welcomed, but at least you got me going in the direction I need.
Thanks.


Kind Regards,
Andréas Kalogéropoulos

From: Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
Sent: Monday, November 30, 2015 5:55 PM
To: user@storm.apache.org
Subject: Re: Using Storm to parse emails and creates batches

Hello Andrea,

Please check my inline answers below.  However, I think its not the topology that is puzzling
you (since you already defined the workflow in steps), rather, the semantics of data involved.
To be more precise, you seem to need some state maintained on different bolts. You have to
define how often the state is updated, where it is stored, whether it is window-based or is
historically accumulated etc. Also, if you manage to have your operators work in a stete-less
way (apply functions on each input tuple), then the challenging part would be to mitigate
any I/O (i.e. contact an external storage) and the processing cost. I hope that you will find
my email useful.


On Mon, Nov 30, 2015 at 11:42 AM, Kalogeropoulos, Andreas <Andreas.Kalogeropoulos@emc.com<mailto:Andreas.Kalogeropoulos@emc.com>>
wrote:
Hello,

I want to use Storm to do three things :

1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP source
For this part, you have to consider the semantics of your processing. For instance, does the
processing involve any state maintenance? If not, it is simply a "filtering" bolt, so you
can be really flexible on its performance. In fact, you can start with an initial parallelism
hint (number of threads executing the filtering mechanism) and then either scale-up/down according
to the actual performance during runtime (capacity reached by those bolts)

2.       Add additional information (based on sender email)
This part looks like its going to perform an I/O in order to get more information (right?).
If yes, you need to consider different engineering ways on how you can retrieve these data.
If not, and you get additional information from the actual mail, then again you can apply
the same idea as in Step 1.

3.       Create an XML based on this data, to inject in another solution
This part is tricky because it is not clear to me whether those XMLs contain aggregated information,
or they are build separately based on the input that each bolt receives. In the former case,
you will need to engineer your desired aggregate operations based on your application semantics.
In the latter, each bolt can produce its XML based on the input it received in a logical window
of operation (either purely time-based or tuple-based).

Only issue, I want step 1 (and 2) to be as fast as possible so creating the maximum bolts/tasks
possible,
But I want the XML to be as big as possible so gathering information for multiple output of
bolts.

In this logic, I fi have 100 mails per second in original input, I would want to have step1
and step 2 to work on the smallest number of emails to do it faster.
But I still want to be able to have an XML that represent 10 000+ emails at the end.

I can’t think of topology to address this.
Can someone give me some pointers to the best way to handle this ?


Kind Regards,
Andréas Kalogéropoulos




--
Nick R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh
Mime
View raw message