asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Torsten Bergh Moss <torsten.b.m...@ig.ntnu.no>
Subject Re: Micro-batch semantics for UDFs
Date Sun, 08 Mar 2020 16:15:30 GMT
Thanks for the feedback, and sorry for the late response, I've been busy with technical interviews.

Xikui, the ingestion framework described in section 5 & 6 of your paper sounds perfect
for my project. I could have an intake job receiving a stream of tweets and an insert job
pulling batches of say 10k tweets from the intake job, preprocess the batch, run it through
the neural network on the GPU to get the sentiments, then write the tweets with their sentiments
to a dataset. Unless there are any unforeseen bottlenecks I think I should be able to achieve
throughputs of up to 20k tweets per second with my current setup.

Is the code related to your project available on a specific branch or in a separate repo maybe?


Also, I believe there might be missing a figure, revealed by the line "The decoupled ingestion
framework is shown in Figure ??" early on page 8.

Best wishes,
Torsten

________________________________________
From: Xikui Wang <xikuiw@uci.edu>
Sent: Sunday, March 1, 2020 5:41 AM
To: dev@asterixdb.apache.org
Subject: Re: Micro-batch semantics for UDFs

Hi Torsten,

In case you want to customize the UDF framework to trigger your UDF on a
batch of records, you could consider reusing the PartitionHolder that I did
for my enrichment for the ingestion project. It takes a number of records,
processes them, and returns with the processed results. I used them to
enable hash joins on feeds and refreshes reference data per batch. That
might be helpful. You can find more information here [1].

[1] https://arxiv.org/pdf/1902.08271.pdf

Best,
Xikui

On Thu, Feb 27, 2020 at 2:35 PM Dmitry Lychagin
<dmitry.lychagin@couchbase.com.invalid> wrote:

> Torsten,
>
> I see a couple of possible approaches here:
>
> 1. Make your function operate on arrays of values instead of primitive
> values.
> You'll probably need to have a GROUP BY in your query to create an array
> (using ARRAY_AGG() or GROUP AS variable).
> Then pass that array to your function which would process it and would
> also return a result array.
> Then unnest that output  array to get the cardinality back.
>
> 2. Alternatively,  you could try creating a new runtime for ASSIGN
> operator that'd pass batches of input tuples to a new kind of function
> evaluator.
> You'll need to provide replacements for
> AssignPOperator/AssignRuntimeFactory.
> Also you'd need to modify InlineVariablesRule[1] so it doesn't inline
> those ASSIGNS.
>
> [1]
> https://github.com/apache/asterixdb/blob/master/hyracks-fullstack/algebricks/algebricks-rewriter/src/main/java/org/apache/hyracks/algebricks/rewriter/rules/InlineVariablesRule.java#L144
>
> Thanks,
> -- Dmitry
>
>
> ´╗┐On 2/27/20, 2:02 PM, "Torsten Bergh Moss" <torsten.b.moss@ig.ntnu.no>
> wrote:
>
>     Greetings everyone,
>
>
>     I'm experimenting a lot with UDF's utilizing Neural Network inference,
> mainly for classification of tweets. Problem is, running the UDF's in a
> one-at-a-time fashion severely under-exploits the capacity of GPU-powered
> NN's, as well as there being a certain latency associated with moving data
> from the CPU to the GPU and back every time the UDF is called, causing for
> poor performance.
>
>
>     Ideally it would be possible use the UDF to process records in a
> micro-batch fashion, letting them accumulate until a certain batch-size is
> reached (as big as my GPU's memory can handle) before passing the data
> along to the neural network to get the outputs.
>
>
>     Is there a way to accomplish this with the current UDF framework
> (either in java or python)? If not, where would I have to start to develop
> such a feature?
>
>
>     Best wishes,
>
>     Torsten Bergh Moss
>
>
>
Mime
View raw message