[ https://issues.apache.org/jira/browse/PHOENIX5491?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Chen Feng updated PHOENIX5491:

Description:
WhereOptimizer runs very slow in parsing sql like "where A in (a1, a2, ..., a_N) and B = X"
when N is very large. In our environment, it runs > 90s for N = 140000.
This is because for the same instance of InListExpression, InListExpression.hashCode() is
calculate every time, where InListExpression.values is traversed.
In previous sql, InListExpression.hashCode() will be called N times, and InListExpression.values
has N elements. Therefore the total complexity is N^2.
Saving the hashCode of InListExpression can reduce the complexity to N. The test shows the
large sql can be finished within 5 seconds.
was:
In WhereOptimizer.pushKeyExpressionsToScan(), has a line of code: "extractNodes.addAll(nodesToExtract)"
When executing sqls like "select * from ... where A in (a1, a2, ..., a_n) and B = X", saying
A in N (N > 100,000) elements, previous code execution will slow (> 90s in our environment).
This is because in such case, extractNodes is a HashSet, nodesToExtract is a List with N
InListExpression (the N InListExpressions are the same instance), each InListExpression.values
has N elements as well.
HashSet.addAll(list<InListExpression>) will call N times of InListExpression.hashCode().
Each time, InListExpression.hashCode() will calculate hashCode for every value. Therefore,
the time complexity will be N^2.
A simple way to solve it is to remember of the hashCode of InListExpression and returns it
directly if calculated once. The query will finish in 5 seconds.
> Improve performance of InListExpression.hashCode
> 
>
> Key: PHOENIX5491
> URL: https://issues.apache.org/jira/browse/PHOENIX5491
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 5.0.0, 4.14.3
> Reporter: Chen Feng
> Assignee: Chen Feng
> Priority: Critical
> Attachments: PHOENIX5491v2.patch, PHOENIX5491.patch
>
>
> WhereOptimizer runs very slow in parsing sql like "where A in (a1, a2, ..., a_N) and
B = X" when N is very large. In our environment, it runs > 90s for N = 140000.
> This is because for the same instance of InListExpression, InListExpression.hashCode()
is calculate every time, where InListExpression.values is traversed.
> In previous sql, InListExpression.hashCode() will be called N times, and InListExpression.values
has N elements. Therefore the total complexity is N^2.
> Saving the hashCode of InListExpression can reduce the complexity to N. The test shows
the large sql can be finished within 5 seconds.

This message was sent by Atlassian Jira
(v8.3.4#803005)
