nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Common scheduler and add-hock thread creation
Date Tue, 17 Nov 2015 02:39:10 GMT
So back in the day...

Here is the thought process behind how it works today at a high level
and taking some generalities.  Developers of extensions, and that
primarily means processors, begin process sessions.  In a process
session a processor can access, create, destroy zero or more flow
files and route them to relationships.  They do not dictate how often
they run or when they run.  The Flow Controller does that.  When it
decides to invoke them it does so by calling the appropriate method.
The thread given in that call is the thread they can use to operate on
that process session.  When they're done with that session be a good
behaved entity and give the thread back to the controller.  That is
it.  They have no control over threads because generally they don't
need them.

Now, some processors are special and they may be written by a
developer that needs greater control of their own threading model,
like web servers for instance.  That is ok but it is also outside of
what is described above.  It is really 'in addition to' what is
described above.  The framework supported path for dealing with
FlowFiles (which is what NiFi is for) is only as above.  It is 'ok'
for these special cases but so far nothing practical has risen to the
level of it needing a framework resolution.  There have been glimmers
but nothing that has really shown to need a resolution as far as
threading goes.  We've considered having different managed thread
pools and then operators could assign a given component on the flow to
those pools.  This way they can preserve a pool for 'sources' vs
'mid-stream' vs 'delivery' processors for example.  Again, this never
reached the level of needing a framework solution.

There have also been cases where folks want to have processors operate
and they do not do *anything* with FlowFiles at all.  These are for
what is known as the 'NiFi-As-A-Fancy-Cron' tool pattern.  We don't
need to support this one.

Now I can definitely conceive of ways to build processors or flows
which will create difficulty in NiFi.  I am ok with that personally.

Thanks
Joe


On Mon, Nov 16, 2015 at 8:50 PM, Oleg Zhurakousky
<ozhurakousky@hortonworks.com> wrote:
> Tony, thanks for your input. At least we have some discussion going. See in line for
the rest.
>
>> On Nov 16, 2015, at 8:22 PM, Tony Kurc <trkurc@gmail.com> wrote:
>>
>> so, I believe threads in a processor in nifi are much, much easier than
>> general threading in many other applications. There are defined boundaries
>> on when a processor is built and torn down. Pretty much any state in the
>> middle is up to the processor. you know when resources need to be stood up.
>> you know when they need to be torn down.
> Generally true and I’d agree there is not much one can do to stop users doing what
they wan to do regardless of how damaging it may be to the rest of the system
>>
>> Because threads have a localized scope, I'm not sure a global pool would be
>> a help. If a processor needs higher throughput or shorter latency, now, the
>> problem is generally isolated and there is a nice little cream center to
>> optimize. If you're blocked on a global pool of threads because some other
>> processor consumed all the threads in a pool, well, suddenly, your
>> performance is no longer a localized problem.
>>
> This argument is argumentative ;)
> 1. What if I’ve saturated all my cores in my localized Processor’s thread pool with
things like while (true){}? Then it really doesn’t matter what the rest of the framework
does, the system is hosed. So blockage in this case comes from let’s just call it malicious
processor and not global thread pool. So, in the end its a bit of a general discipline question
;)
> 2. So in this case one of the best practices could be taken right from Brian’s book
that states that tasks should be as short lived as possible. Any repeats and  retries, should
be handled by rerunning/rescheduling a task instead of spinning in the loop inside of task.
So with global Scheduler exposed via context or something that each Processor, Service etc.
sees we can have a shared Thread pool. We can even have ControllerService as ThreadPools.
> Yes, that would take some serious code review and general discipline from the developers
but the benefit would be proportional as well.
>
>> because the common case is "don't use threads" (not everyone is going to
>> build a complex service, contribute to the core framework or need threads
>> in their processor) I actually think code review is a good way to shake out
>> some poor decisions. because optimizing the threads in a processor for a
>> use case a specialized task (the processor writer knows the critical
>> sections and bottlenecks), I'm not sure whether there are massive strides
>> that can be made, but I could be wrong. And we'll always have a weird edge
>> case of some library that wants to do threads its own way that we're trying
>> to integrate.
>>
>> My guess is a lot of the behavior you mention above are because at the
>> moment, performance isn't needed in that part of code and it was simpler
>> for the author. Or its a bug!
> I would probably use "performance isn't needed” argument but in hypothetical word of
thousands of processors each creating Threads, the so called ’simplicity' could manifest
itself as a bug.
>
> I don’t wan to generalize to much at he moment as it is much easier to discuss a concrete
case (we have plenty). But I really wanted to get discussion going on this as I am still studying
the code base.
>
> Cheers
> Oleg
>
>>
>>
>>
>> On Mon, Nov 16, 2015 at 8:01 PM, Oleg Zhurakousky <
>> ozhurakousky@hortonworks.com> wrote:
>>
>>> Taking liberties - so let me throw few example. I am sure you’d agree that
>>> Thread creation and management is an expensive and hard and error prone,
>>> hence new java.util.concurrent and all the goodies in it.
>>> - There is a patch currently in the queue where there is a creation of new
>>> Thread() and then starting it. Is it necessary? Could we reuse the thread
>>> from the common pool?
>>> - We have many places where we have Thread.sleep(..) and in fact do sleep
>>> considerable amount of time. That thread lays dormant where it could
>>> actually be doing something. Is it necessary?
>>>
>>> Cheers
>>> Oleg
>>>
>>>
>>>> On Nov 16, 2015, at 7:52 PM, Tony Kurc <trkurc@gmail.com> wrote:
>>>>
>>>> the issue with a best practices guide on this subject is it will be
>>>> dominated by edge cases. The common case should be "don't produce any
>>>> threads".
>>>>
>>>> That being said, I commented on a jira somewhere about
>>> LinkedBlockingQueues
>>>> used in so many producer/consumer style processors and possibly needing a
>>>> library to have some consistency in using those queues in a consistent
>>>> thread safe manner.
>>>>
>>>> Also, I'm not quite sure of what you mean by taking liberties?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 16, 2015 at 7:39 PM, Oleg Zhurakousky <
>>>> ozhurakousky@hortonworks.com> wrote:
>>>>
>>>>> Guys
>>>>>
>>>>> I am noticing many modules where we have things like "new
>>>>> Thread(..).start()”, creation of new executors and schedulers,
>>>>> Thread.sleep(..)  etc.,. I am sure many would agree that taking such
>>>>> liberties with Threads will have consequences (not IF but WHEN)
>>>>> On several threads several of us mentioned a “must read” for anyone
who
>>> is
>>>>> getting into concurrent code -
>>>>>
>>> http://ptgmedia.pearsoncmg.com/images/9780321349606/samplepages/9780321349606.pdf
>>>>> and indeed we can/should definitely grab some best practices from this
>>> book.
>>>>>
>>>>> At least we can start from what’s our strategy around thread management
>>>>> for NAR developers? Basically should/should not a user create Threads,
>>>>> Executors, Schedulers etc.
>>>>>
>>>>> Cheers
>>>>> Oleg
>>>>>
>>>
>>>
>

Mime
View raw message