storm-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladi Feigin <vladi...@gmail.com>
Subject Re: Is Storm the right tool for me?
Date Mon, 01 Dec 2014 13:03:23 GMT
Sorry, I've meant Spark.

On Mon, Dec 1, 2014 at 11:38 AM, Stadin, Benjamin <
Benjamin.Stadin@heidelberg-mobil.com> wrote:

> Thanks for your response.
> Shark doesn’t seem to be something I want / need. The custom data handler
> is performance critical, file based (SQLite file) and already highly
> optimized (e.g. File sync is off, giving. And this db is associated to a
> single user sessions and should not be replicated but rather be a local
> temporary source existing only on the executing node – otherwise
> replicating these files will become a bottle neck. But maybe this is still
> possible to configure with Shark?
>
>
> Von: Vladi Feigin <vladif86@gmail.com>
> Antworten an: "user@storm.apache.org" <user@storm.apache.org>
> Datum: Montag, 1. Dezember 2014 06:16
> An: "user@storm.apache.org" <user@storm.apache.org>
> Betreff: Re: Is Storm the right tool for me?
>
> Hi
> Sounds to me you need an ETL offline process MR/Shark offline to get the
> processed data to db.
> Storm fits the use cases when you have continous data stream and the
> processing time with a low latency..
> On 1 Dec 2014 04:26, "Stadin, Benjamin" <
> Benjamin.Stadin@heidelberg-mobil.com> wrote:
>
>> Hi all,
>>
>> I need some advise whether Storm is the right tool for my purpose. My
>> requirements share commonalities with „big data“, workflow coordination and
>> „reactive“ event driven data processing (as in for example Haskell Arrows),
>> which doesn’t make it any easier to find the right tool set.
>>
>> To explain my needs it’s probably best to give an example scenario:
>>
>>    - A user uploads small files (typically 1-200 files, file size
>>    typically 2-10MB per file)
>>    - Files should be converted in parallel and on available nodes. The
>>    conversion is actually done via native tools, so there is not so much big
>>    data processing required, but dynamic parallelization (so for example to
>>    split the conversion step into as many conversion tasks as files are
>>    available). The conversion typically takes between several minutes and a
>>    few hours.
>>    - The converted files gathered and are stored in a single database
>>    (containing geometries for rendering)
>>    - Once the db is ready, a web map server is (re-)configured and the
>>    user can make small updates to the data set via a web UI.
>>    - … Some other data processing steps which I leave away for brevity …
>>    - There will be initially only a few concurrent users, but the system
>>    shall be able to scale if needed
>>
>> My current thoughts:
>>
>>    - I should avoid to upload files into the distributed storage during
>>    conversion, but probably should rather have each conversion filter download
>>    the file it is actually converting from a shared place. Other wise it’s bad
>>    for scalability reasons (too many redundant copies of same temporary files
>>    if there are many concurrent users and many cluster nodes).
>>    - Apache Oozie seems an option to chain together my pipes into a
>>    workflow. But is it a good fit with Storm?
>>    - Apache Crunch seems to make it easy to dynamically parallelize
>>    tasks (Oozie itself can’t do this). But I may not need crunch after all if
>>    I have Storm, and it also doesn’t seem to fit to my last problem following.
>>    - The part that causes me the most headache is the user interactive
>>    db update: I consider to use Kafka as message bus to broker between the web
>>    UI and a custom db handler (nb, the db is a SQLite file). Here I see Storm
>>    would serve my purpose better than Spark (Streaming) since it should have
>>    immediate update responsiveness and the handler is probably best
>>    implemented as a long running continuing task. But does Storm allow to
>>    create such long running tasks dynamically, so that when another (web) user
>>    starts a new task a new long-running task is created? Also, is it possible
>>    to identify a running task, so that a long running task can be bound to a
>>    session (db handler working on local db updates, until task done)?
>>
>>
>> ~Ben
>>
>

Mime
View raw message