airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Guziel <alex.guz...@airbnb.com.INVALID>
Subject Re: [Discuss] Airflow sensor optimization
Date Wed, 06 Mar 2019 21:54:11 GMT
Sensor-service thing seems to open the door to make sensors a pubsub-type
deal where possible. For example, in Hive, you can keep an in-memory
registry of what partitions to sense for, and tail the audit log to see
when they are populated, instead of polling.

On Wed, Mar 6, 2019 at 1:51 PM Alex Guziel <alex.guziel@airbnb.com> wrote:

> Smart sensor seems like a good idea, but I wonder how much performance
> will be improved in practice. And of course, one must think about sharding
> and such.
>
> I'm not sure how helpful rescheduling sensors is, since it will add
> scheduler and DB load seemingly, which is already a bottleneck.
>
> On Wed, Mar 6, 2019 at 12:43 PM Yingbo Wang <ybwang@gmail.com> wrote:
>
>> I would still like to get some feedback on the batch sensor/smart sensor
>> idea after viewing the sensor rescheduling PR. Since the reschedule mode
>> does not reduce the number of worker processes for sensor. The batch
>> sensor
>> idea is proposed for this purpose and should work well with reschedule
>> mode.
>>
>> On Wed, Mar 6, 2019 at 11:30 AM Yingbo Wang <ybwang@gmail.com> wrote:
>>
>> > Wow, Great work from Seelmann! Thanks Fokko for letting us know it. We
>> are
>> > super happy to have this feature.
>> >
>> > On Wed, Mar 6, 2019 at 11:24 AM Driesprong, Fokko <fokko@driesprong.frl
>> >
>> > wrote:
>> >
>> >> Thanks for bringing this up. I've added a comment on the Wiki:
>> >>
>> >>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
>> >>
>> >> Have you looked into the work by Seelmann? Recently he introduced the
>> >> ability to reschedule sensors. When rescheduling, the slot will be
>> given
>> >> back to the scheduler after a poke operation. Therefore the slot won't
>> be
>> >> occupied all the time. The details are in the PR
>> >> https://github.com/apache/airflow/pull/3596
>> >>
>> >> I would propose to make this the default behavior in Airflow 2.0.
>> >>
>> >> Cheers, Fokko
>> >>
>> >> Op wo 6 mrt. 2019 om 15:32 schreef Yingbo Wang <ybwang@gmail.com>:
>> >>
>> >> > hi,
>> >> >
>> >> > I would like to open an AIP for Airflow sensor optimization.
>> >> >
>> >> >
>> >> > *Motivation*:
>> >> >
>> >> > Low efficiency in Airflow Sensor Implementation
>> >> >
>> >> > Sensors are a special kind of operator that will keep running until
a
>> >> > certain criterion is met. Examples include a specific file landing
in
>> >> HDFS
>> >> > or S3, a partition appearing in Hive, or a specific time of the day.
>> >> > Sensors are derived from BaseSensorOperator and run a poke method at
>> a
>> >> > specified poke_interval until it returns True.
>> >> >
>> >> > The reason that the sensor tasks are inefficient is because in
>> current
>> >> > design, we sprawn a separate worker process for each partition
>> sensor.
>> >> This
>> >> > worker might last a long time, until the target partition is
>> >> available.  In
>> >> > the case where there are many sensor tasks that need to run within
>> >> certain
>> >> > time limits, we have to allocate a lot of resources to have enough
>> >> workers
>> >> > for the sensor tasks.
>> >> >
>> >> > *Idea:*
>> >> >
>> >> > We propose two approaches that could address this issues,
>> batch-sensor
>> >> > and smart-sensor.
>> >> >
>> >> >
>> >> >
>> >> > Batch-sensor
>> >> >
>> >> > The basic idea of batch-sensor is to batch process sensor tasks to
>> save
>> >> > resources. During running, a batch-sensor will take N partition
>> sensor
>> >> > requests as the input and poke those N partitions periodically. If
>> the
>> >> > batch-sensor finds that the criteria of some sensor task is met, the
>> >> > batch-sensor will update the database about this sensor tasks.
>> >> >
>> >> >
>> >> > To do this, we need to create a sensor basic class called ‘batchable’
>> >> and
>> >> > make all sensors inherit from this basic class. We also need to
>> change
>> >> the
>> >> > behavior of schedule regarding a batchable sensor tasks. The schedule
>> >> will
>> >> > find as many as possible batchable sensor tasks and run those tasks
>> in a
>> >> > batch.
>> >> >
>> >> >
>> >> > Smart-sensor
>> >> >
>> >> > Smart-sensor is an improvement on top of batch-sensor.
>> >> >
>> >> > The idea of smart-sensor is that the worker process of smart-sensor
>> will
>> >> > run like a service. To do this, we need to persist Sensor details in
>> >> > Airflow DB and the worker process periodically queries task-instance
>> >> table
>> >> > to find sensor tasks; poke the metastore and update the task instance
>> >> table
>> >> > if it detects that certain partition or file created.
>> >> >
>> >>
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message