lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues
Date Fri, 26 Jan 2018 22:22:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341692#comment-16341692
] 

Hoss Man commented on SOLR-11917:
---------------------------------

h2. *S2.1*: Easy Multi-Language Querying (SOLR-6492)
h3. *S2.1G*: Goal

Simplified indexing & querying of text in diff languages w/o the query clients being _required_
to know about a lot of language specific variant field names. At index time we want things
to be "easy" for clients wending documents, regardless of whether they already know the lang
of each field value in advance, or if they want solr to do langauge detection.
h3. *S2.1A*: Suggested Approach
{panel:title=Refresher: Summary of Solr In Action (SIA) code linked to from}
SOLR-6492
*What's included & how it works...*
 * custom update process & custom field type
 ** processor is subclass of existing lang detect update processor
 *** super class normally adds a field with languages in doc, or renames fields to include
language (ie: text => text_de)
 ** field type is subclass of TextField
 *** goes out of it's way to override any Analyzer config with a custom one (details below)
 *** configured with a list of mappings from langid to other (existing) field types
 * Index Time:
 ** update processor delegates to super to detect languages but instead of (in addition to?)
super class's behavior of adding a language field to doc, or renaming the field with suffix,
the custom processor "decorates" the values with the detected language(s)...
 ** for any field where the field type is our custom type:
 *** "decorate" each of the field values with either:
 **** the langs of the whole doc
 **** the langs of the field (after re-running lang detect on all values in just that field)
 **** the langs of the individual field value (after re-running lang detect on just that field
value)
 ** other processors can then run as normal, and eventually the IndexSchema is asked to build
up the IndexableFields for this doc, and it delegates to the (custom) field type for these
"decorated" fields...
 ** field type's custom analyzer looks for these lang "decorations" on each field value
 *** for every lang found, go fetch the analyzer from the mapped field type it's configured
with
 *** create a token stream that delegates to all the other analyzers & merges the resulting
token streams
 *** all this custom delegation/merging tokenstream stuff is (optionally/wisely) wrapped in
RemoveDuplicatesTokenFilter since there can be lots of dup tokens for similar languages.
 * Query Time:
 ** the query string provided by the user can be "decorated" with a list of languages
 ** the normal plumbing of TextField analyzes the query string, delegating to the various
analyzers
 *** AFAICT: this means MultiTermPhraseQueries are frequently produced?
 * NOTE: as mentioned in Trey's LR talk for 2014, a "perk" of this solution (over using diff
fields per languages) is that mixing languages in one field value can – in theory – still
produce useful phrase queries, even if the non-correct analyzers butcher the terms in other
languages such that a single phrase produced by either language analyzer wouldn't match the
original string
 ** [https://www.youtube.com/watch?v=MQ6WtBw8T_U]
 ** BUT: it's not really clear if/how useful/important this is. _Does any one have any actual
usecases for this???_

*The Fiddly / Awkward / Problematic Bits Of All This Existing Code*
 * language "decoration" is super hackish
 ** index time:
 *** the update processor prepends them as a string
 *** not a lot of easy improvements currently possible given the current SolrInputDocument
/ UpdateProcessor / DocumentBuilder structure / code paths
 **** fixing this "THE RIGHT WAY" would probably require some pretty big changes to all this
code so SolrInputField could support arbitrary metadata (instead of just "boost" like it does
today) and passing the SolrInputFields all the way to the FieldType's createFields method
 **** the hackish way to do this might be to follow in the footsteps of atomic update with
"field value may be a map containing magic keys", but...
 ***** this would probably break Atomic Updates (unexpected keys in the Maps it thinks it
owns)
 ***** this was already a super heinous API hack and hacks this heinous should not be reworded
by being copied
 ***** Even if we did this, i'm not certain the FieldType's createFields() would get the full
Map w/o a bunch of other changes in the middle – if we're going to have to change existing
DocumentBuilder/IndexSchema code to make this work, let's not be heinous about it.
 ** query time:
 *** user must prefixing the terms _inside_ the query strings – _after_ the field name
 *** example: {{q=my_multi_lang_field:"en,es|Hello there compadre"}}
 *** fixing this in a sane way should be really straight forward...
 **** all of the "public Query getFoo(...)" methods a FieldType must implement take in the
QParser originating the query
 **** we can ask the QParser for the local/req params
 **** so syntax like "q= \{!field f=body langs='en,es'}Hello there compadre" would be easy
to support
 * the tokenstream merging slurps in the entire Reader as a String on first use, then pre-analyzes
using every analyzer and builds up an in memory LinkedList<Token>
 ** why is this needed? why can't we just cache _one_ "Token" per Analyzer? ie...
 *** each call to incrementTokens calls incrementTokens on any delegate analyzer where we
don't have a cached token (and says it's not done with the input)
 *** then return (and null out) whichever cached token has the lowest position
 ** also: since this is super custom code and we know the way our analyzer is getting used
is from our custom FieldType, why mess with a Reader -> String at all since we know for
certain the Reader is a StringReader
 *** ie: bypass the normal "this.getAnalyzer()" and just give the "Analyzer" the original
String ?{panel}
h4. *S2.1.STRAW1*: Straw Man Proposal #1 – aka "Complex For Users"

Existing SIA Code + query time local params
 * keep most of the existing SOLR-6492 code as is
 ** all indexing code and update processor sub class stay the same
 *** including the hackish way we have to prefix-decorate the langs on field values at index
time
 *** hopefully fix the analyzer to be more efficient
 ** at query time:
 *** in the FieldType: override the "public Query getFoo(...)" methods to look at the local/req
params for a langs
 *** use those langs when using our custom (wrapper) analyzer
 *** *NOTE:* This cleaner query time API still has a hitch – see *S2.1.HITCH* below
 * This approach seems more "complex" to explain to users then the strawman #2 (*S2.1.STRAW2*)
below
 ** particularly given the dependency on the new update processor (or users adding magic field
value decoration at index time)
 ** and especially if/when they gain more experience with solr and want to understand more
what's happening under the covers and how to tweak/customize behavior.
 ** see full pro/con list below

h4. *S2.1E.STRAW1*: Hypothetical Example Usage of this *S2.1.STRAW1* Strawman...
{code:xml}
<field name="title" type="langaware" />
<field name="body" type="langaware" />

<fieldType name="langaware" class="solr.MultiLangAwareTextField"
           defaultFieldType="text_general"
           fieldMappings="en:text_english,
                          es:text_latin,
                          fr:text_french"/>
<fieldType name="text_general" ... />
<fieldType name="text_english" ... />
<fieldType name="text_french" ... />
<fieldType name="text_latin" ... />
{code}
{code:xml}
<!-- doc sent by client using new custom update processor -->
<doc>
  <field name="title">Solr In Action</field>
  <field name="body">Ipsum Lorem ... thousands of pages of text</field>
<doc>

<!-- doc sent by client that knows what lang these fields are
<doc>  
  <field name="title">en|Solr In Action</field>
  <field name="body">la|Ipsum Lorem ... thousands of pages of text</field>
<doc>

{code}
{noformat}
# Uses the lang specific analysis the user asked for
/query?q={!lang=la}body:Lorem&fq={!field f=title lang=en}Action

# Falls back to the text_general analysis since no lang is known
/query?q=body:Lorem&fq={!field f=title}Action
{noformat}
h4. *S2.1.STRAW2*: Straw Man Proposal #2 – aka "Simple for Users"

Override only the Query parsing bits of TextField (or huper duper text field)
 * continue using diff fields per lang ague (either dynamic or explicitly) in schema
 * continue using the existing clone / lang detect update processors to processors copy/rename
fields (ie: title => title + title_es)
 * let the analyzer for types like "text_es" do it's regular analysis and indexing into the
underlying fields like "title_es"
 * let types like "text_multilang" for fields like "title" be a new QueryLangAwareProxyTextField
that extends TextField
 ** still supports a direct analyzer configuration for it's "default" behavior (ie: something
simple that is as lang agnostic as possible, aka: text_general)
 ** at index time, just does it's regular indexing with it's configured analyzer
 ** at query time:
 *** if the QParser's params don't indicate a lang ague, do a normal query against the specified
field
 *** if the QParser does specify some languages:
 **** build up a list lang specific field names using the current field name + the languages
(ie "title" + "_" + "es")
 ***** fetch the FieldType's for each of those field names from the IndexSchema
 ***** delegate to the equivalent "public Query getFoo" for each of those FieldTypes, wrap
the results in a DisjunctionMaxQuery
 *** *NOTE:* This use of QParser params still has the same hitch as strawman #1 (*S2.1.STRAW1*)
– see *S2.1.HITCH* below
 * This approach seems simpler to explain to new users then strawman #1 (*S2.1.STRAW1*)
 ** particularly given that it can be useful (in a clean way) even w/o any (new) langid update
processors when users already know the language of the fields for each doc, but just want
simplified querying.
 ** see full pro/con list below

h4. *S2.1E.STRAW2*: Hypothetical Examples of this *S2.1.STRAW2* Strawman #2...
{code:xml}
<field name="title" type="langaware" />
<field name="body" type="langaware" />

<fieldType name="langaware" class="solr.QueryLangAwareProxyTextField">
  <!-- no special mappings needed, just simple lang agnostic default analyzers -->
  <analyzer type="index" ... />
  <analyzer type="query" ... />
</fieldType>

<dynamicField name="*_en" type="text_english" ... />
<dynamicField name="*_fr" type="text_french" ... />
<dynamicField name="*_la" type="text_latin" ... />

<fieldType name="text_english" ... />
<fieldType name="text_french" ... />
<fieldType name="text_latin" ... />
{code}
{code:xml}
<!-- sample doc sent by client using langid update processor -->
<!-- title copied to title_en, body copied to body_la -->
<doc>
  <field name="title">Solr In Action</field>
  <field name="body">Ipsum Lorem ... thousands of pages of text</field>
<doc>

<!-- sample doc sent by client that knows what lang these fields are -->
<!-- CloneFieldUpdateProcessor or something simple like can copy these to "title" &
"body" -->
<doc>  
  <field name="title_en">Solr In Action</field>
  <field name="body_en">Ipsum Lorem ... thousands of pages of text</field>
<doc>

{code}
{noformat}
# rewrites the queries against the lang specific versions using the langs the user asked for
/query?q={!lang=la}body:Lorem&fq={!field f=title lang=en}Action

# Falls back to the default analysis (configured on 'langaware' type) since no 'lang' is specified
/query?q=body:Lorem&fq={!field f=title}Action

# user can still choose to sort on, or filter against, the existence of data in specific language
fields
/query?q=body_la:Lorem&sort=title_la asc
{noformat}
h4. *S2.1.HITCH*: One Hitch @ Query Time To Both Strawmen

Currently, SolrQueryParserBase/QueryBuilder sometimes uses the "Analyzer" (IndexSchema's per
field wrapper) directly w/o delegating to ft.getFieldQuery(...).

*Best solution I can think of:*
 * SolrQueryParserBase should override createFieldQuery(Analyzer,...) in a way that it can
delegate to the FieldType
 ** must happen in such a way that the FieldType can make a callback to the low level QueryBuilder.createFieldQuery
– otherwise we'll have to copy/paste a lot of existing code.
 ** NOTE: QueryBuilder.createFieldQuery currently protected.
 * This callback should involve a QParser (like the existing "public Query getFoo" methods
on FieldType) to access the flags/params to capture some of the QueryBuilder state / variables
passed to createFieldQuery
 ** in our special case, we ignore the specified Analyzer and pick one at query time
 * GENERAL IMPROVEMENT IDEA:
 ** maybe QParser should extend SolrQueryParserBase/QueryBuilder and automatically call some
QueryBuilder setter methods based on common local params (like "df", "f", "q.op", etc...)
 ** some existing QParsers (like LuceneQParser and ExtendedDismaxQParser could then be refactored
to do their query parsing directly (instead of the QParser instantiating a custom subclass
of SolrQueryParser)
 ** this would potentially simplify a variety of existing QParser subclasses
 ** could also simplify some FieldType.getFoo methods that currently call "new FooQuery" –
they could instead delegate back to QParser.newFooQuery
 ** if we did this, then the callback mechanism needed for these strawmen ideas would be (mostly?)
straight forward:
 *** QParser would override QueryBuilder.createFieldQuery(Analyzer,...) to delegate to the
FieldType's getFieldQuery, passing in a nested/sub-QParser with the various method call specific
options included as state/params
 *** QParser would also expose a new public method that the FieldType could call back to that
would ultimately call super.createFieldQuery(Analyzer,...)

*Hypothetical (Broken) Alternatively:*
 * we could consider eliminating the analyzers "cache" that IndexSchema uses (only helpful
for non-dynamic fields) and change getQueryAnalyzer to take in a QParser can can capture some
query/request state so that the FieldType can customize the Analyzer behavior
 * then our special field type can delegate to a completely diff FieldType
 * The Problem With This Alternative:
 ** the field *name* that QueryBuilder then uses in the underlying Query objects would still
be "wrong"
 ** This would _not_ be a problem with the callback approach discussed above, because our
new FieldType could call callback to the QueryBuilder methods w/any field name + analyzer
pair it wanted.

h4. *S2.1.PROCON*: Pros/Cons of the two Strawmen
 * The *S2.1.STRAW2* approach seems simpler to understand/explain to users
 ** no special/magic field types they have to declare and reference from another field
 *** ie: they declare/manage title, title_en, title_es, title_de fields – the only thing
special is that querying the "title" field can proxy to the others as well _when the query
requests it_
 *** this means this approach is also automatically compatible with people who want to explicitly
index multiple fields for each language:
 **** ie: they already have/know an "english title" and a "translated spanish title" in the
source docs, and don't need any index side (langdetect/copyfield) help – our new field type
just helps make the query side simple/easy to use.
 ** also plays nicely if we decide to do the SortableTextField described above (*S1.1*) and
want to extend it here:
 *** the user still has distinct fields for "title", "title_es", etc... and can choose to
sort on any of them
 **** even in the trivial case, where they only have one original field value per doc (which
lang detect also copied to title_XX), they probably always want to sort on the general "title"
field – "keep simple stuff simple"
 ** can be implemented completely independently/orthogonally from all the ideas discussed
here
 ** the downsides of *S2.1.STRAW2* are:
 *** doesn't give us the "mixed languages in a single field value" phrase query benefit (which
seemed out of scope? do we have usecases like this we care about?)
 *** doesn't "save space" like single field approach when multi languages produce same tokens
 **** Although i'm not convinced that's fundamentally true – or even beneficial: since any
space savings from diff languages producing the same underlying "term" text may be offset
by potential false positives in phrase matches (since we're assuming that even if multiple
(guessed) languages may be specified at query time, the query string is expected to be in
a single language and searching across those multiple languages should be done independently

 * the *S2.1.STRAW1* approach seems like it would be harder to explain to novice users
 ** ie: the special configuration of refering to (otherwise seemingly unused) fieldtypes from
special fieldtypes
 *** we've had this in the psat with some things like ExternalFileField & CurrencyField
and it's always confusing
 ** the special prefix decoration of langs at indexing time also means this approach either
*requires* users learn about & use the new update processor (ie: the features are locked
together), or require some explanation of how clients must decorate the field values
 ** which also means that this approach would also not play very nicely with people who have
pre-translated field values at index time
 *** we could potentially offer an "prepend lang code update processor" to make it easy to
massage their data for them
 **** ex: {{title_es:"Hola Juan", title_en:"Hello John"}} ==> {{title:["es|Hola Juan",
"en|Hello John"]}}
 ** if we extend the SortableTextField (*S1.1*) idea...
 *** only the trivial usecases (each logical field is only in one language) plays nicely with
sorting
 *** if a user starts with multiple different "translated" fields – and has to consolidate
them as multiple field values in a single field (with our hypothetical "prepend lang code
update processor") then they don't really have any way to "sort on the spanish title" with
this approach
 **** unless of course they *also* redundantly index every lang variant as it's own field
– but then most of the benefits of this approach are out the window (ie: there are no fieldname/configuration/disk
savings as compared to the other strawman)
 ** The key upside i can think of for *S2.1.STRAW1*:
 *** If we first focus on *S2.2* (see below), then the schema syntax could potentially be
simplified to remove the "lang -> some other fieldType name" mapping and instead use lots
of nested analyzers named after each langauge
 *** this might still be a bit confusing however if people want diff index/query(/multiTerm)
analyzers for each langague ... would have to use some sort of regid naming convention?

{panel}
*NOTE:* If either strawman is implemented, we should strongly consider including an additional
option/subclass of this new "*LangAwareTextField" to automatically use the langid plugin code
at query time to try and "guess" the lang if it isn't specified in a 'lang' local/request
params
 * at least for language-detect (latest version), there are special models built just for
short inputs
 * we could potentially make the code use the guessed lang at query time only if above some
configured confidence:
 ** or: if explicit 'lang' param, use only that lang – but if the langauge is guessed, query
using both the field/analyzer for that specific lang as well as the 'default' field/analyzer{panel}
 

 

> A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring
docValues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11917
>                 URL: https://issues.apache.org/jira/browse/SOLR-11917
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter field
types" in Solr. In particular to think about:
>  # How to simplify some of the "special things" people have to know about Solr behavior
when creating their schemas
>  # How to reduce the number of situations where users have to copy/clone one "logical
field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that people
seem to have - many of which are already tracked in existing jiras - along with a high level
design/roadmap of potential solutions for these goals that can be implemented incrementally
to leverage some common changes (and what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader community
discussion, and as a central linkage point for the related jiras. (details to follow in a
very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on implementing
every aspect of these ideas :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message