lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues
Date Fri, 26 Jan 2018 22:21:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341687#comment-16341687
] 

Hoss Man commented on SOLR-11917:
---------------------------------

h1. Some Concrete Thoughts On *S*olutions

*NOTE:* While there is a one-to-one corrispondice in the naming/numbering of the *U*secases
listed above and the proposed *S*olutions listed below, I have ordered the *S*olutions in
the way that I think makes the most sense from an "explaining how to achieve things" standpoint.
----
h2. *S1.1*: A 'SortableTextField' that builds docValues using the original text input
h3. *S1.1G*: Goal

A new SortableTextField subclass would be added that would functionally work the same as TextField
except:
 * {{docValues="true|false"}} could be configured, with the default being "true"
 * The docValues would contain (a prefix of) the original input values (just like StrField)
for sorting (or faceting)
 ** By default, to protect users from excessively large docValues, only the first 1024 of
each field value would be used – but this could be overridden with configuration.

h3. *S1.1E*: Example Usage

Consider the following sample configuration:
{code:java}
<field name="title" type="text_sortable" docValues="true"
       indexed="true" docValues="true" stored="true" multiValued="false"/>
<fieldType name="text_sortable" class="solr.SortableTextField">
  <analyzer type="index">
   ...
  </analyzer>
  <analyzer type="query">
   ...
  </analyzer>
</fieldType>
{code}
Given a document with a title of "Solr In Action"

Users could:
 * Search for individual (indexed) terms in the "title" field: {{q=title:solr}}
 * Sort documents by title ( {{sort=title asc}} ) such that this document's sort value would
be "Solr In Action"

If another document had a "title" value that was longer then 1024 chars, then the docValues
would be built using only the first 1024 characters of the value (unless the user modified
the configuration)

NOTE: This would be functionally equivalent to the following existing configuration - including
the on disk index segments - except that the on disk DocValues would refer directly to the
"title" field, reducing the total number of "field infos" in the index (which has a small
impact on segment housekeeping and merge times) and end users would not need to sort on an
alternate "title_string" field name - the original "title" field name would always be used
directly.
{code:java}
<field name="title" type="text"
       indexed="true" docValues="true" stored="true" multiValued="false"/>
<field name="title_string" type="string"
       indexed="false" docValues="true" stored="false" multiValued="false"/>
<copyField source="title" dest="title_string" maxChars="1024" />
{code}
h3. *S1.1A*: Suggested Approach (SOLR-11916)

While experimenting with a quick POC for this idea, I actually wound up building a {{SortableTextField}}
that is feature complete. See patch in SOLR-11916.

NOTE: If/when *S1.3A* is implemented, this SortableTextField could be refactored to be syntactic
sugar for TextField w/ some added defaults – see below.
----
h2. *S1.2*: A 'TermDocValuesTextField' that builds docValues using the post-analysis terms
h3. *S1.2G*: Goal

A new TermDocValuesTextField subclass would be added that would functionally work the same
as TextField except:
 * {{docValues="true|false"}} could be configured, with the default being "true"
 * Instances of fields using this type would support faceting (or sorting), using DocValues
build from the terms produced by the "index" analyzer
 ** NOTE: Sorting on this type of field would only make sense in some special circumstances
depending on the analyzer used (ie: KeywordTokenizer)

h3. *S1.2E*: Example Usage

Consider the following sample configuration
{code:java}
<field name="keywords" type="text_facet" docValues="true"
       indexed="true" docValues="true" stored="true" multiValued="true"/>
<fieldType name="text_facet" class="solr.TermDocValuesTextField">
  <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory" rule="unicode"/>
   ...
  </analyzer>
</fieldType>

<field name="author" type="text_lc_sort" docValues="true"
       indexed="true" docValues="true" stored="true" multiValued="false"/>
<fieldType name="text_lc_sort" class="solr.TermDocValuesTextField">
  <analyzer>
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
{code}
Given a document with an author of "Grainger, Trey" and keywords value of of "book lucene
solr"

Users could:
 * Search for individual (indexed) terms in the "keywords" field: q=keywords:book
 * Facet on the keywords field (facet.field=keywords) such that if this were the only document
in the index, the facet counts would be "book=1, lucene=1, solr=1"
 * Sort documents by author (sort=title asc) such that this document's sort value would be
"grainger, trey"

NOTE: This should be functionally equivalent to users faceting on a "keywords" TextField (or
sorting on an "author" TextField using KeywordTokenizer) today, except that the facet/sort
values would come from DocValues (written at indexing time), and not the FieldCache (built
on the fly at query time and held solely in RAM).
h3. *S1.2A*: Suggested Approach
 * Add a new TermDocValuesTextField subclass of TextField
 * if docValues="true":
 ** Augment the configured "index" analyzer to record each resulting token from the stream
in a Set
 ** When indexing, pre-analyze/buffer the token stream and use the recorded Set of tokens
to build additional SortedSetDocValuesField instances in the underling indexed document
 * OPTIMIZATION?: We may be able to avoid the pre-analysis/buffering of the TokenStream and
instead hook into the low level indexing code with a callback to generate the SortedSetDocValuesField
instances on the fly as the DocumentsWriter reads from the (original) TokenStream ... needs
experimentation/refactoring once we have some tests.

NOTE: If/when *S1.3A* is implemented, this TermDocValuesTextField could be refactored to be
syntactic sugar for TextField w/ some added defaults – see below.

 

 

> A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring
docValues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11917
>                 URL: https://issues.apache.org/jira/browse/SOLR-11917
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter field
types" in Solr. In particular to think about:
>  # How to simplify some of the "special things" people have to know about Solr behavior
when creating their schemas
>  # How to reduce the number of situations where users have to copy/clone one "logical
field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that people
seem to have - many of which are already tracked in existing jiras - along with a high level
design/roadmap of potential solutions for these goals that can be implemented incrementally
to leverage some common changes (and what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader community
discussion, and as a central linkage point for the related jiras. (details to follow in a
very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on implementing
every aspect of these ideas :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message