storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Powis <spo...@salesforce.com>
Subject Re: Using Storm to parse emails and creates batches
Date Tue, 01 Dec 2015 11:45:14 GMT
Yep, sounds like you got it......you'd want to use field grouping and group
on a field that contains the hash.  Then every tuple that has that field
with the identical hashes would get sent to the same bolt instance.

On Tue, Dec 1, 2015 at 8:23 PM, Kalogeropoulos, Andreas <
Andreas.Kalogeropoulos@emc.com> wrote:

> Making sure that duplicates make it in the same XML file (third bolt).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:59 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> So you want to eliminate duplicates or make sure that duplicates make it
> into the same XML file (third bolt)?
>
>
>
> On Tue, Dec 1, 2015 at 7:48 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> Imagine that the spout is providing me 300 000 emails per hour.
>
> The first bolt will parse/analyze the information (from, to, cc, subject,
> object, date, has of attachments, …  , and probably will find the same hash
> for some attachments (someone forwarding an email).
>
>
>
> The last bolt will create an XML based on all this information, but if I
> can have the tuples containing the same attachment (based on hash) in the
> same XML, I can actually apply a dedup logic : having multiple lines in my
> xml pointing to the same file
>
>
>
> Does this make more sense ?
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 11:36 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> I'm not sure I follow/understand your question or what you're trying to do.
>
>
>
> On Tue, Dec 1, 2015 at 7:28 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> You are right. Sorry for making you state the obvious J.
>
>
>
> Last question : If my spout has incoming information that I want to have
> in the same last bolt (the one creating the XML) for deduplication logic,
> what is the best way to achieve this ?  My instinct says to try to work
> with Fields grouping and the correct key (probably conversation since I am
> working with emails).
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Tuesday, December 01, 2015 10:27 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> If you are using Storm's guaranteed message processing
> <http://storm.apache.org/documentation/Guaranteeing-message-processing.html>
> there is no need to 'persist' the collection anywhere other than in
> memory.  IE List<Tuple> myListOfTuples = new ArrayList<Tuple>();  If the
> third bolt crashes and loses its in memory collection, after
> Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS
> <http://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS>
> the tuples will timeout and be replayed thru your entire storm topology and
> your collection will be repopulated
>
>
>
> On Tue, Dec 1, 2015 at 5:15 PM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello Stephen,
>
>
>
> I think you got I correctly. Thanks a lot for the idea.
>
> If you have seen limitations, please send the disclaimers J . For
> example, how did you handle persistence of this collection ? If the third
> bolt failed while populating the collection (size and time has not been
> reached) we just lost everything, so I need to have a status loopback of
> what was really output. Right ?
>
>
>
> Of course, if you can send me the code of your third bolt (especially the
> collection handling), I’ll be grateful.
>
> In all cases, thanks a lot for your help, even without the code, you
> really give me example advice, and now I can start building something.
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
> *From:* Stephen Powis [mailto:spowis@salesforce.com]
> *Sent:* Monday, November 30, 2015 5:55 PM
> *To:* user@storm.apache.org
> *Subject:* Re: Using Storm to parse emails and creates batches
>
>
>
> From what I understand from your description, you want bolt 3 to collect
> results from multiple tuples and build a single xml for them.  We've done
> this by essentially doing the following:
>
>
>
> Bolt 3 has a collection of tuples.  As a tuple comes in, we add it to the
> collection and check the size of the collection.  Once the size of the
> collection exceeds some number, we then process all of the tuples in one
> go, and then ACK all of them after the processing completes.
>
> Building on that, we've implemented an additional constraint on time.  If
> the collection size > N OR if we've waited more than X seconds, process the
> batch.  This way your output won't stall out if your topology has a lull in
> data being ingested.
>
> And then lastly, there's a corner case where say 10 tuples come in and get
> held by our collection but then no other tuples come in for a long period
> of time.  If no tuples enter, that means the size and timeout checks are
> never executed and your bolt will hold onto those tuples for a long time
> (potentially causing timeouts).  To handle this, we made use of tick
> tuples.  Tick tuples essentially allow you to you to send a special tuple
> to your bolt every Y seconds.  We use that to trigger checking the time
> constraint is checked on a regular basis (example being send a tick tuple
> every 1, 5, or 10 seconds)
>
>
>
> On Tue, Dec 1, 2015 at 1:42 AM, Kalogeropoulos, Andreas <
> Andreas.Kalogeropoulos@emc.com> wrote:
>
> Hello,
>
>
>
> I want to use Storm to do three things :
>
> 1.       Parse emails data (from/ to / cc/ subject ) from incoming SMTP
> source
>
> 2.       Add additional information (based on sender email)
>
> 3.       Create an XML based on this data, to inject in another solution
>
>
>
> Only issue, I want step 1 (and 2) to be as fast as possible so creating
> the maximum bolts/tasks possible,
>
> But I want the XML to be as big as possible so gathering information for
> multiple output of bolts.
>
>
>
> In this logic, I fi have 100 mails per second in original input, I would
> want to have step1 and step 2 to work on the smallest number of emails to
> do it faster.
>
> But I still want to be able to have an XML that represent 10 000+ emails
> at the end.
>
>
>
> I can’t think of topology to address this.
>
> Can someone give me some pointers to the best way to handle this ?
>
>
>
>
>
> Kind Regards,
>
> *Andréas Kalogéropoulos*
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message