asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xikui Wang <>
Subject Re: Micro-batch semantics for UDFs
Date Sun, 01 Mar 2020 04:41:03 GMT
Hi Torsten,

In case you want to customize the UDF framework to trigger your UDF on a
batch of records, you could consider reusing the PartitionHolder that I did
for my enrichment for the ingestion project. It takes a number of records,
processes them, and returns with the processed results. I used them to
enable hash joins on feeds and refreshes reference data per batch. That
might be helpful. You can find more information here [1].



On Thu, Feb 27, 2020 at 2:35 PM Dmitry Lychagin
<> wrote:

> Torsten,
> I see a couple of possible approaches here:
> 1. Make your function operate on arrays of values instead of primitive
> values.
> You'll probably need to have a GROUP BY in your query to create an array
> (using ARRAY_AGG() or GROUP AS variable).
> Then pass that array to your function which would process it and would
> also return a result array.
> Then unnest that output  array to get the cardinality back.
> 2. Alternatively,  you could try creating a new runtime for ASSIGN
> operator that'd pass batches of input tuples to a new kind of function
> evaluator.
> You'll need to provide replacements for
> AssignPOperator/AssignRuntimeFactory.
> Also you'd need to modify InlineVariablesRule[1] so it doesn't inline
> those ASSIGNS.
> [1]
> Thanks,
> -- Dmitry
> ´╗┐On 2/27/20, 2:02 PM, "Torsten Bergh Moss" <>
> wrote:
>     Greetings everyone,
>     I'm experimenting a lot with UDF's utilizing Neural Network inference,
> mainly for classification of tweets. Problem is, running the UDF's in a
> one-at-a-time fashion severely under-exploits the capacity of GPU-powered
> NN's, as well as there being a certain latency associated with moving data
> from the CPU to the GPU and back every time the UDF is called, causing for
> poor performance.
>     Ideally it would be possible use the UDF to process records in a
> micro-batch fashion, letting them accumulate until a certain batch-size is
> reached (as big as my GPU's memory can handle) before passing the data
> along to the neural network to get the outputs.
>     Is there a way to accomplish this with the current UDF framework
> (either in java or python)? If not, where would I have to start to develop
> such a feature?
>     Best wishes,
>     Torsten Bergh Moss

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message