lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-9526) data_driven configs defaults to "strings" for unmapped fields, makes most fields containing "textual content" unsearchable, breaks tutorial examples
Date Wed, 05 Jul 2017 17:36:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075116#comment-16075116
] 

Steve Rowe edited comment on SOLR-9526 at 7/5/17 5:36 PM:
----------------------------------------------------------

Attaching patch brought up to date with master (in particular, collapsing of {{data_driven_schema_configs}}
and {{basic_configs}} into {{_default}}) - note that your original patch only modified {{solrconfig.xml}}
on one of these and {{managed_schema}} on the other - I assume you had/have local changes
that didn't make it into the patch [~janhoy]?  I made a couple of other changes; details below.

{quote}
See new NOCOMMIT comments. I was using the ManagedIndexSchema method
{code}
public ManagedIndexSchema addCopyFields(String source, Collection<String> destinations,
int maxChars)
{code}
which does not have a {{persist=true/false}} argument, so calling it leaves the schema not
persisted. Then I could not find a way to explicitly persist it since method
{{boolean persistManagedSchema(boolean createOnly)}}
was not public. In this patch I've made it public and done a hacky instanceof check in AddSchemaFieldsUpdateProcessorFactory
{code}
if (newSchema instanceof ManagedIndexSchema) {
  // NOCOMMIT: Hack to avoid persisting schema once after addFields and then once after each
copyField
  ((ManagedIndexSchema)newSchema).persistManagedSchema(false);
}
{code}
Steve Rowe, you wrote the {{addCopyFields()}} method a while ago, is there a cleaner way to
make sure schema is persisted after adding a copyField?
{quote}

The design of {{ManagedIndexSchema}}'s API was in support of the Schema REST API, where each
resource was modifiable one at a time; "bulk" modifications weren't possible.  In the new
bulk schema API, though, the ordinary case involves multiple modifications; in this case,
it is counter-productive to persist in the middle of a set of operations.

SOLR-6476 (introducing schema "bulk" mode) added the option to *not* persist the schema after
an operation; previously every operation was automatically persisted.  This was added as an
option because at the time, bulk and REST modes co-existed.   SOLR-7682 added the ability
to specify maxChars for copyField directives, and I intentionally left off the {{persist}}
option of the new {{addCopyFields()}} method, because there was (intentionally) no way to
invoke this capability via the (now deprecated) schema REST API, and the bulk schema API didn't
need the {{persist}} option.

Long story short: I think making {{persistManagedSchema()}} public is a natural consequence
of the bulk schema API (and in support of bulk operations from other sources, e.g. this issue).
 It's just that nobody had gotten around to it yet.  

In {{AddSchemaFieldsUpdateProcessorFactory.processAdd()}} in my patch I removed the {{instanceof
ManagedIndexSchema}} check wrapping the call to {{persistManagedSchama()}}, as well as the
{{NOCOMMIT}}'s, since the check {{if ( ! cmd.getReq().getSchema().isMutable())}} at the beginning
of the method already ensures that we're dealing with a {{ManagedIndexSchema}}.

I also removed the following {{typeMapping}} that was added in your patch from URP chains
{{add-fields-no-run-processor}} and {{parse-and-add-fields}} in {{solrconfig-add-schema-fields-update-processor-chains.xml}}
- I'm assuming this is a vestige from an earlier concept of removing {{<defaultTypeMapping>}},
since these chains have {{<str name="defaultFieldType">text</str>}}?  {{AddSchemaFieldsUpdateProcessorFactoryTest}}
passes with my change:

{code:xml}
<lst name="typeMapping">
  <str name="valueClass">java.lang.String</str>
  <str name="fieldType">text</str>
</lst>
{code}


was (Author: steve_rowe):
Attaching patch brought up to date with master (in particular, collapsing of {{data_driven_schema_configs}}
and {{basic_configs}} into {{_default}}) - note that your original patch only modified {{solrconfig.xml}}
on one of these and {{managed_schema}} on the other - I assume you had/have local changes
that didn't make it into the patch [~janhoy]?  I made a couple of other changes; details below.

{quote}
See new NOCOMMIT comments. I was using the ManagedIndexSchema method
{code}
public ManagedIndexSchema addCopyFields(String source, Collection<String> destinations,
int maxChars)
{code}
which does not have a {{persist=true/false}} argument, so calling it leaves the schema not
persisted. Then I could not find a way to explicitly persist it since method
{{boolean persistManagedSchema(boolean createOnly)}}
was not public. In this patch I've made it public and done a hacky instanceof check in AddSchemaFieldsUpdateProcessorFactory
{code}
if (newSchema instanceof ManagedIndexSchema) {
  // NOCOMMIT: Hack to avoid persisting schema once after addFields and then once after each
copyField
  ((ManagedIndexSchema)newSchema).persistManagedSchema(false);
}
{code}
Steve Rowe, you wrote the {{addCopyFields()}} method a while ago, is there a cleaner way to
make sure schema is persisted after adding a copyField?
{quote}

The design of {{ManagedIndexSchema}}'s API was in support of the Schema REST API, where each
resource was modifiable one at a time; "bulk" modifications weren't possible.  In the new
bulk schema API, though, the ordinary case involves multiple modifications; in this case,
it is counter-productive to persist in the middle of a set of operations.

SOLR-6476 (introducing schema "bulk" mode) added the option to *not* persist the schema after
an operation; previously every operation was automatically persisted.  This was added as an
option because at the time, bulk and REST modes co-existed.   SOLR-7682 added the ability
to specify maxChars for copyField directives, and I intentionally left off the {{persist}}
option of the new {{addCopyFields()}} method, because there was (intentionally) no way to
invoke this capability via the (now deprecated) schema REST API, and the bulk schema API didn't
need the {{persist}} option.

Long story short: I think making {{persistManagedSchema()}} public is a natural consequence
of the bulk schema API (and in support of bulk operations from other sources, e.g. this issue).
 It's just that nobody had gotten around to it yet.  

In the {{AddSchemaFieldsUpdateProcessorFactory.processAdd()}} in my patch I removed the {{instanceof
ManagedIndexSchema}} check wrapping the call to {{persistManagedSchama()}}, as well as the
{{NOCOMMIT}}'s, since the check {{if ( ! cmd.getReq().getSchema().isMutable())}} at the beginning
of the method already insures that we're dealing with a {{ManagedIndexSchema}}.

I also removed the following {{typeMapping}} that was added in your patch from URP chains
{{add-fields-no-run-processor}} and {{parse-and-add-fields}} in {{solrconfig-add-schema-fields-update-processor-chains.xml}}
- I'm assuming this is a vestige from an earlier concept of removing {{<defaultTypeMapping>}},
since these chains have {{<str name="defaultFieldType">text</str>}}?  {{AddSchemaFieldsUpdateProcessorFactoryTest}}
passes with my change:

{code:xml}
<lst name="typeMapping">
  <str name="valueClass">java.lang.String</str>
  <str name="fieldType">text</str>
</lst>
{code}

> data_driven configs defaults to "strings" for unmapped fields, makes most fields containing
"textual content" unsearchable, breaks tutorial examples
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9526
>                 URL: https://issues.apache.org/jira/browse/SOLR-9526
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors
>            Reporter: Hoss Man
>            Assignee: Jan H√łydahl
>              Labels: dynamic-schema
>             Fix For: 7.0
>
>         Attachments: SOLR-9526.patch, SOLR-9526.patch, SOLR-9526.patch, SOLR-9526.patch,
SOLR-9526.patch
>
>
> James Pritchett pointed out on the solr-user list that this sample query from the quick
start tutorial matched no docs (even though the tutorial text says "The above request returns
only one document")...
> http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=name:foundation
> The root problem seems to be that the add-unknown-fields-to-the-schema chain in data_driven_schema_configs
is configured with...
> {code}
> <str name="defaultFieldType">strings</str>
> {code}
> ...and the "strings" type uses StrField and is not tokenized.
> ----
> Original thread: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201609.mbox/%3CCAC-n2zRPsspfnK43AGeCspchc5b-0FF25xLfnzogYuVyg2dWbw@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message