ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Mashenkov <andrey.mashen...@gmail.com>
Subject Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)
Date Fri, 04 Oct 2019 15:05:53 GMT
Yuriy,

Just FYI we have a review checklist [1], coding guidelines [2].
To test a PR someone can use TeamCity [3] or TeamCityBot project [4].

The last way (using TCBot) makes test validation much easier and do not
bother with flacky tests.
Long story short you can trigger tests for the PR from Bot page and then
make Bot attach these results to a Jira ticket if you found results
acceptable.

So, next step is to run tests and chek if all is ok.

[1] https://cwiki.apache.org/confluence/display/IGNITE/Review+Checklist
[2] https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines
[3] https://ci.ignite.apache.org/
[4] https://mtcga.gridgain.com/
<https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines#CodingGuidelines-TODOs>


On Fri, Oct 4, 2019 at 3:10 PM Yuriy Shuliga <shuliga@gmail.com> wrote:

> Andrew,
>
> I have corrected PR according to your notes. Please review.
> What will be the next steps in order to merge in?
>
> Y.
>
> чт, 3 жовт. 2019 о 17:47 Andrey Mashenkov <andrey.mashenkov@gmail.com>
> пише:
>
> > Yuri,
> >
> > I've done with review.
> > No crime found, but trivial compatibility bug.
> >
> > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <shuliga@gmail.com> wrote:
> >
> > > Denis,
> > >
> > > Thank you for your attention to this.
> > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189
> > ticket
> > > is still pending review.
> > > Do we have a chance to move it forward somehow?
> > >
> > > BR,
> > > Yuriy Shuliha
> > >
> > > пн, 30 вер. 2019 о 23:35 Denis Magda <dmagda@apache.org> пише:
> > >
> > > > Yuriy,
> > > >
> > > > I've seen you opening a pull-request with the first changes:
> > > > https://issues.apache.org/jira/browse/IGNITE-12189
> > > >
> > > > Alex Scherbakov and Ivan are you the right guys to do the review?
> > > >
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <vololo100@gmail.com>
> > > wrote:
> > > >
> > > > > Yuriy,
> > > > >
> > > > > Thank you for providing details! Quite interesting.
> > > > >
> > > > > Yes, we already have support of distributed limit and merging
> sorted
> > > > > subresults for SQL queries. E.g. ReduceIndexSorted and
> > > > > MergeStreamIterator are used for merging sorted streams.
> > > > >
> > > > > Could you please also clarify about score/relevance? Is it provided
> > by
> > > > > Lucene engine for each query result? I am thinking how to do sorted
> > > > > merge properly in this case.
> > > > >
> > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <shuliga@gmail.com>:
> > > > > >
> > > > > > Ivan,
> > > > > >
> > > > > > Thank you for interesting question!
> > > > > >
> > > > > > Text searches (or full text searches) are mostly human-oriented.
> > And
> > > > the
> > > > > > point of user's interest is topmost part of response.
> > > > > > Then user can read it, evaluate and use the given records for
> > further
> > > > > > purposes.
> > > > > >
> > > > > > Particularly in our case, we use Ignite for operations with
> > financial
> > > > > data,
> > > > > > and there lots of text stuff like assets names, fin. instruments,
> > > > > companies
> > > > > > etc.
> > > > > > In order to operate with this quickly and reliably, users used
to
> > > work
> > > > > with
> > > > > > text search, type-ahead completions, suggestions.
> > > > > >
> > > > > > For this purposes we are indexing particular string data in
> > separate
> > > > > caches.
> > > > > >
> > > > > > Sorting capabilities and response size limitations are very
> > important
> > > > > > there. As our API have to provide most relevant information
in
> view
> > > of
> > > > > > limited size.
> > > > > >
> > > > > > Now let me comment some Ignite/Lucene perspective.
> > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs
> > > *already
> > > > > > sorted by *score *(relevance). So most relevant documents are
on
> > the
> > > > top.
> > > > > > And currently distributed queries responses from different nodes
> > are
> > > > > merged
> > > > > > into final query cursor queue in arbitrary way.
> > > > > > So in fact we already have the score order ruined here. Also
> Ignite
> > > > > > requests all possible documents from Lucene that is redundant
and
> > not
> > > > > good
> > > > > > for performance.
> > > > > >
> > > > > > I'm implementing *limit* parameter to be part of *TextQuery
*and
> > have
> > > > to
> > > > > > notice that we still have to add sorting for text queries
> > processing
> > > in
> > > > > > order to have applicable results.
> > > > > >
> > > > > > *Limit* parameter itself should improve the part of issues from
> > > above,
> > > > > but
> > > > > > definitely, sorting by document score at least  should be
> > implemented
> > > > > along
> > > > > > with limit.
> > > > > >
> > > > > > This is a pretty short commentary if you still have any
> questions,
> > > > please
> > > > > > ask, do not hesitate)
> > > > > >
> > > > > > BR,
> > > > > > Yuriy Shuliha
> > > > > >
> > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <vololo100@gmail.com>
> пише:
> > > > > >
> > > > > > > Yuriy,
> > > > > > >
> > > > > > > Greatly appreciate your interest.
> > > > > > >
> > > > > > > Could you please elaborate a little bit about sorting?
What
> tasks
> > > > does
> > > > > > > it help to solve and how? It would be great to provide
an
> > example.
> > > > > > >
> > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov
<
> > > > > > > alexey.scherbakoff@gmail.com>:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > I like the idea of throwing an exception for enabled
text
> > queries
> > > > on
> > > > > > > > persistent caches.
> > > > > > > >
> > > > > > > > Also I'm fine with proposed limit for unsorted searches.
> > > > > > > >
> > > > > > > > Yury, please proceed with ticket creation.
> > > > > > > >
> > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <dmagda@apache.org>:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > >
> > > > > > > > > I see nothing wrong with Yury's proposal in regards
> full-text
> > > > > search
> > > > > > > API
> > > > > > > > > evolution as long as Yury is ready to push it
forward.
> > > > > > > > >
> > > > > > > > > As for the in-memory mode only, it makes total
sense for
> > > > in-memory
> > > > > data
> > > > > > > > > grid deployments when Ignite caches data of an
underlying
> DB
> > > like
> > > > > > > Postgres.
> > > > > > > > > As part of the changes, I would simply throw
an exception
> (by
> > > > > default)
> > > > > > > if
> > > > > > > > > the one attempts to use text indices with the
native
> > > persistence
> > > > > > > enabled.
> > > > > > > > > If the person is ready to live with that limitation
that an
> > > > > explicit
> > > > > > > > > configuration change is needed to come around
the
> exception.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -
> > > > > > > > > Denis
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga
<
> > > shuliga@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello to all again,
> > > > > > > > > >
> > > > > > > > > > Thank you for important comments and notes
given below!
> > > > > > > > > >
> > > > > > > > > > Let me answer and continue the discussion.
> > > > > > > > > >
> > > > > > > > > > (I) Overall needs in Lucene indexing
> > > > > > > > > >
> > > > > > > > > > Alexei has referenced to
> > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371
where
> > > > > > > > > > absence of index persistence was declared
as an obstacle
> to
> > > > > further
> > > > > > > > > > development.
> > > > > > > > > >
> > > > > > > > > > a) This ticket is already closed as not
valid.b) There
> are
> > > > > definite
> > > > > > > needs
> > > > > > > > > > (and in our project as well) in just in-memory
indexing
> of
> > > > > selected
> > > > > > > data.
> > > > > > > > > > We intend to use search capabilities for
fetching limited
> > > > amount
> > > > > of
> > > > > > > > > records
> > > > > > > > > > that should be used in type-ahead search
/ suggestions.
> > > > > > > > > > Not all of the data will be indexed and
the are no need
> in
> > > > Lucene
> > > > > > > index
> > > > > > > > > to
> > > > > > > > > > be persistence. Hope this is a wide pattern
of
> text-search
> > > > usage.
> > > > > > > > > >
> > > > > > > > > > (II) Necessary fixes in current implementation.
> > > > > > > > > >
> > > > > > > > > > a) Implementation of correct *limit *(*offset*
seems to
> be
> > > not
> > > > > > > required
> > > > > > > > > in
> > > > > > > > > > text-search tasks for now)
> > > > > > > > > > I have investigated the data flow for distributed
text
> > > queries.
> > > > > it
> > > > > > > was
> > > > > > > > > > simple test prefix query, like 'name'*='ene*'*
> > > > > > > > > > For now each server-node returns all response
records to
> > the
> > > > > > > client-node
> > > > > > > > > > and it may contain ~thousands, ~hundred
thousands
> records.
> > > > > > > > > > Event if we need only first 10-100. Again,
all the
> results
> > > are
> > > > > added
> > > > > > > to
> > > > > > > > > > queue in GridCacheQueryFutureAdapter in
arbitrary order
> by
> > > > pages.
> > > > > > > > > > I did not find here any means to deliver
deterministic
> > > result.
> > > > > > > > > > So implementing limit as part of query and
> > > > > (GridCacheQueryRequest)
> > > > > > > will
> > > > > > > > > not
> > > > > > > > > > change the nature of response but will limit
load on
> nodes
> > > and
> > > > > > > > > networking.
> > > > > > > > > >
> > > > > > > > > > Can we consider to open a ticket for this?
> > > > > > > > > >
> > > > > > > > > > (III) Further extension of Lucene API exposition
to
> Ignite
> > > > > > > > > >
> > > > > > > > > > a) Sorting
> > > > > > > > > > The solution for this could be:
> > > > > > > > > > - Make entities comparable
> > > > > > > > > > - Add custom comparator to entity
> > > > > > > > > > - Add annotations to mark sorted fields
for Lucene
> indexing
> > > > > > > > > > - Use comparators when merging responses
or reducing to
> > > desired
> > > > > > > limit on
> > > > > > > > > > client node.
> > > > > > > > > > Will require full result set to be loaded
into memory.
> > Though
> > > > > can be
> > > > > > > used
> > > > > > > > > > for relatively small limits.
> > > > > > > > > > BR,
> > > > > > > > > > Yuriy Shuliha
> > > > > > > > > >
> > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei
Scherbakov <
> > > > > > > > > alexey.scherbakoff@gmail.com>
> > > > > > > > > > пише:
> > > > > > > > > >
> > > > > > > > > > > Yuriy,
> > > > > > > > > > >
> > > > > > > > > > > Note what one of major blockers for
text queries is [1]
> > > which
> > > > > makes
> > > > > > > > > > lucene
> > > > > > > > > > > indexes unusable with persistence and
main reason for
> > > > > > > discontinuation.
> > > > > > > > > > > Probably it's should be addressed first
to make text
> > > queries
> > > > a
> > > > > > > valid
> > > > > > > > > > > product feature.
> > > > > > > > > > >
> > > > > > > > > > > Distributed sorting and advanved querying
is indeed
> not a
> > > > > trivial
> > > > > > > task.
> > > > > > > > > > > Some kind of merging must be implemented
on query
> > > originating
> > > > > node.
> > > > > > > > > > >
> > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371
> > > > > > > > > > >
> > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38,
Denis Magda <
> > > dmagda@apache.org
> > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Yuriy,
> > > > > > > > > > > >
> > > > > > > > > > > > If you are ready to take over
the full-text search
> > > indexes
> > > > > then
> > > > > > > > > please
> > > > > > > > > > go
> > > > > > > > > > > > ahead. The primary reason why
the community wants to
> > > > > discontinue
> > > > > > > them
> > > > > > > > > > > first
> > > > > > > > > > > > (and, probable, resurrect later)
are the limitations
> > > listed
> > > > > by
> > > > > > > Andrey
> > > > > > > > > > and
> > > > > > > > > > > > minimal support from the community
end.
> > > > > > > > > > > >
> > > > > > > > > > > > -
> > > > > > > > > > > > Denis
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM
Andrey Mashenkov <
> > > > > > > > > > > > andrey.mashenkov@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Yuriy,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunatelly, there is
a plan to discontinue
> > > > TextQueries
> > > > > in
> > > > > > > > > Ignite
> > > > > > > > > > > [1].
> > > > > > > > > > > > > Motivation here is text indexes
are not persistent,
> > not
> > > > > > > > > transactional
> > > > > > > > > > > and
> > > > > > > > > > > > > can't be user together with
SQL or inside SQL.
> > > > > > > > > > > > > and there is a lack of interest
from community
> side.
> > > > > > > > > > > > > You are weclome to take on
these issues and make
> > > > > TextQueries
> > > > > > > great.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1,  PageSize can't be used
to limit resultset.
> > > > > > > > > > > > > Query results return from
data node to client-side
> > > cursor
> > > > > in
> > > > > > > > > > > page-by-page
> > > > > > > > > > > > > manner and
> > > > > > > > > > > > > this parameter is designed
control page size. It is
> > > > > supposed
> > > > > > > query
> > > > > > > > > > > > executes
> > > > > > > > > > > > > lazily on server side and
> > > > > > > > > > > > > it is not excepted full resultset
be loaded to
> memory
> > > on
> > > > > server
> > > > > > > > > side
> > > > > > > > > > at
> > > > > > > > > > > > > once, but by pages.
> > > > > > > > > > > > > Do you mean you found Lucene
load entire resultset
> > into
> > > > > memory
> > > > > > > > > before
> > > > > > > > > > > > first
> > > > > > > > > > > > > page is sent to client?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd think a new parameter
should be added to limit
> > > > result.
> > > > > The
> > > > > > > best
> > > > > > > > > > > > > solution is to use query
language commands for
> this,
> > > e.g.
> > > > > > > > > > > "LIMIT/OFFSET"
> > > > > > > > > > > > in
> > > > > > > > > > > > > SQL.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This task doesn't look trivial.
Query is
> distributed
> > > > > operation
> > > > > > > and
> > > > > > > > > > same
> > > > > > > > > > > > > user query will be executed
on data nodes
> > > > > > > > > > > > > and then results from all
nodes should be correcly
> > > merged
> > > > > > > before
> > > > > > > > > > being
> > > > > > > > > > > > > returned via client-cursor.
> > > > > > > > > > > > > So, LIMIT should be applied
on every node and then
> on
> > > > merge
> > > > > > > phase.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, this may be non-obviuos,
limiting results
> make
> > no
> > > > > sence
> > > > > > > > > without
> > > > > > > > > > > > > sorting,
> > > > > > > > > > > > > as there is no guarantee
every next query run will
> > > return
> > > > > same
> > > > > > > data
> > > > > > > > > > > > because
> > > > > > > > > > > > > of page reordeing.
> > > > > > > > > > > > > Basically, merge phase receive
results from data
> > nodes
> > > > > > > > > asynchronously
> > > > > > > > > > > and
> > > > > > > > > > > > > messages from different nodes
can't be ordered.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2.
> > > > > > > > > > > > > a. "tokenize" param name
(for @QueryTextFiled)
> looks
> > > more
> > > > > > > verbose,
> > > > > > > > > > > isn't
> > > > > > > > > > > > > it.
> > > > > > > > > > > > > b,c. What about distributed
query? How partial
> > results
> > > > from
> > > > > > > nodes
> > > > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > merged?
> > > > > > > > > > > > >  Does Lucene allows to configure
comparator for
> data
> > > > > sorting?
> > > > > > > > > > > > > What comparator Ignite should
choose to sort result
> > on
> > > > > merge
> > > > > > > phase?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3. For now Lucene engine
is not configurable at
> all.
> > > E.g.
> > > > > it is
> > > > > > > > > > > > impossible
> > > > > > > > > > > > > to configure Tokenizer.
> > > > > > > > > > > > > I'd think about possible
ways to configure engine
> at
> > > > first
> > > > > and
> > > > > > > only
> > > > > > > > > > > then
> > > > > > > > > > > > go
> > > > > > > > > > > > > further to discuss\implement
complex features,
> > > > > > > > > > > > > that may depends on engine
config.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17
PM Yuriy Shuliga <
> > > > > > > shuliga@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear community,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > By starting this chain
I'd like to open
> discussion
> > > that
> > > > > would
> > > > > > > > > come
> > > > > > > > > > to
> > > > > > > > > > > > > > contribution results
in subj. area.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ignite has indexing
capabilities, backed up by
> > > > different
> > > > > > > > > > mechanisms,
> > > > > > > > > > > > > > including Lucene.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently, Lucene 7.5.0
is used (past year
> > release).
> > > > > > > > > > > > > > This is a wide spread
and mature technology that
> > > covers
> > > > > text
> > > > > > > > > search
> > > > > > > > > > > > area
> > > > > > > > > > > > > > and beyond (e.g. spacial
data indexing).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My goal is to *expose
more Lucene functionality
> to
> > > > Ignite
> > > > > > > > > indexing
> > > > > > > > > > > and
> > > > > > > > > > > > > > query mechanisms for
text data*.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's quite simple request
at current stage. It is
> > > > coming
> > > > > > > from our
> > > > > > > > > > > > > project's
> > > > > > > > > > > > > > needs, but i believe,
will be useful for a lot
> more
> > > > > people.
> > > > > > > > > > > > > > Let's walk through and
vote or discuss about Jira
> > > > > tickets for
> > > > > > > > > them.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1.[trivial] Use  dataQuery.getPageSize()
 to
> limit
> > > > search
> > > > > > > > > response
> > > > > > > > > > > > items
> > > > > > > > > > > > > > inside GridLuceneIndex.query().
Currently it is
> > > calling
> > > > > > > > > > > > > > IndexSearcher.search(query,
*Integer.MAX_VALUE*)
> -
> > so
> > > > > > > basically
> > > > > > > > > all
> > > > > > > > > > > > > scored
> > > > > > > > > > > > > > matches will me returned,
what we do not need in
> > most
> > > > > cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2.[simple] Add sorting.
 Then more capable search
> > > call
> > > > > can be
> > > > > > > > > > > > > > executed: *IndexSearcher.search(query,
count,
> > > > > > > > > > > > > > sort) *
> > > > > > > > > > > > > > Implementation steps:
> > > > > > > > > > > > > > a) Introduce boolean
*sortField* parameter in
> > > > > > > *@QueryTextFiled *
> > > > > > > > > > > > > > annotation. If
> > > > > > > > > > > > > > *true *the filed will
be indexed but not
> tokenized.
> > > > > Number
> > > > > > > types
> > > > > > > > > > are
> > > > > > > > > > > > > > preferred here.
> > > > > > > > > > > > > > b) Add *sort* collection
to *TextQuery*
> > constructor.
> > > It
> > > > > > > should
> > > > > > > > > > define
> > > > > > > > > > > > > > desired sort fields
used for querying.
> > > > > > > > > > > > > > c) Implement Lucene
sort usage in
> > > > > GridLuceneIndex.query().
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3.[moderate] Build complex
queries with
> > *TextQuery*,
> > > > > > > including
> > > > > > > > > > > > > > terms/queries boosting.
> > > > > > > > > > > > > > *This section for voting
only, as requires more
> > > > detailed
> > > > > > > work.
> > > > > > > > > > Should
> > > > > > > > > > > > be
> > > > > > > > > > > > > > extended if community
is interested in it.*
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Looking forward to your
comments!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > BR,
> > > > > > > > > > > > > > Yuriy Shuliha
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Andrey V. Mashenkov
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ivan Pavlukhin
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>


-- 
Best regards,
Andrey V. Mashenkov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message