lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: DataImportHandler with a managed-schema only import id and version
Date Wed, 10 Aug 2016 09:02:02 GMT
Seem you might be right, according to the source:
https://github.com/apache/lucene-solr/blob/master/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java#L662

Sometimes, the magic (and schemaless is rather magical) fails when
combined with older assumptions (and DIH is kind of legacy).

You can still declare dynamic fields and use preffix/suffix to map to
the types. That would work just fine and avoid guessing.

Or you could use API to predefine the fields in the schema.

Or use the POST method with XSLT preprocessor (yes, Solr has that too
somewhere).

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 August 2016 at 18:42, Pierre Caserta <pierre.caserta@gmail.com> wrote:
> I am rebuilding a new docker image with each change on the config file so solr starts
fresh every time.
>
>   <requestHandler name="/dataimport" initParams="myInitParams" class="solr.DataImportHandler">
>       <lst name="defaults">
>         <str name="update.chain">add-unknown-fields-to-the-schema</str>
>         <str name="config">solr-data-config.xml</str>
>       </lst>
>   </requestHandler>
>
> still having document like such:
>
> "response":{"numFound":8,"start":0,"docs":[
>       {
>         "id":"38822",
>         "_version_":1542264667720646656},
>       {
>
> If add add the Body field using the Schema section of the Admin UI, This field is getting
indexed during the dataimport.
> It seems that solr.DataImportHandler does not allow the add-unknown-fields-to-the-schema
update.chain.
>
> Pierre
>
>> On 10 Aug 2016, at 18:33, Alexandre Rafalovitch <arafalov@gmail.com> wrote:
>>
>> Ok, to reduce the magic, you can just stick "update.chain" parameter
>> inside the defaults of the dataimport handler directly.
>>
>> You can also pass it just as a URL parameter. That's what 'defaults'
>> section mean.
>>
>> And, just to be paranoid, you did reload the core after each of those
>> changes to test it? These are not picked up automatically.
>>
>> Regards,
>>    Alex.
>> ----
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 10 August 2016 at 18:28, Pierre Caserta <pierre.caserta@gmail.com> wrote:
>>> It did not work,
>>> I tried many things and ended up trying this:
>>>
>>>  <requestHandler name="/dataimport" initParams="myInitParams" class="solr.DataImportHandler">
>>>      <lst name="defaults">
>>>        <str name="config">solr-data-config.xml</str>
>>>      </lst>
>>>  </requestHandler>
>>>  <initParams name="myInitParams" path="/update/**,/dataimport">
>>>    <lst name="defaults">
>>>      <str name="update.chain">add-unknown-fields-to-the-schema</str>
>>>    </lst>
>>>  </initParams>
>>>
>>> Regards,
>>> Pierre
>>>
>>>> On 10 Aug 2016, at 18:08, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:
>>>>
>>>> Your initParams section does not apply to /dataimport handler as
>>>> defined. Try modifying it to say:
>>>> path="/update/**,/dataimport"
>>>>
>>>> Hopefully, that's all that takes.
>>>>
>>>> Managed schema is enabled by default, but schemaless mode is the next
>>>> layer on top. With managed schema, you can use the API to add your
>>>> fields (or new Admin UI in the Schema screen). With schemaless mode,
>>>> it tries to guess the field type as it adds it automatically.
>>>>
>>>>
>>>> Regards,
>>>>   Alex.
>>>>
>>>> ----
>>>> Newsletter and resources for Solr beginners and intermediates:
>>>> http://www.solr-start.com/
>>>>
>>>>
>>>> On 10 August 2016 at 18:04, Pierre Caserta <pierre.caserta@gmail.com>
wrote:
>>>>> Hi Alex,
>>>>> thanks for your answer.
>>>>>
>>>>> Yes my solrconfig.xml contains the add-unknown-fields-to-the-schema.
>>>>>
>>>>> <initParams path="/update/**">
>>>>>   <lst name="defaults">
>>>>>     <str name="update.chain">add-unknown-fields-to-the-schema</str>
>>>>>   </lst>
>>>>> </initParams>
>>>>>
>>>>> I created my core using this command:
>>>>>
>>>>> curl http://192.168.99.100:8999/solr/admin/cores?action=CREATE&name=solrexchange&instanceDir=/opt/solr/server/solr/solrexchange&configSet=data_driven_schema_configs_custom
>>>>>
>>>>> I am using the example configset data_driven_schema_configs and I simply
added:
>>>>>
>>>>> <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar"
/>
>>>>> <requestHandler name="/dataimport" class="solr.DataImportHandler">
>>>>>     <lst name="defaults">
>>>>>       <str name="config">data-config.xml</str>
>>>>>     </lst>
>>>>> </requestHandler>
>>>>>
>>>>> I thought the schemaless mode was enable by default but I also tried
adding this config but I get the same result.
>>>>>
>>>>> <schemaFactory class="ManagedIndexSchemaFactory">
>>>>>   <bool name="mutable">true</bool>
>>>>>   <str name="managedSchemaResourceName">managed-schema</str>
>>>>> </schemaFactory>
>>>>>
>>>>> How can I update my schemaless URP chain and add the parameter to call
it to DIH?
>>>>>
>>>>>
>>>>>> On 10 Aug 2016, at 17:43, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:
>>>>>>
>>>>>> Do you have the actual fields defined? If not, then I am guessing
that
>>>>>> your 'post' test was against a different collection that had
>>>>>> schemaless mode enabled and your DIH one is against one where
>>>>>> schemaless mode is not enabled (look for
>>>>>> 'add-unknown-fields-to-the-schema' in the solrconfig.xml to confirm).
>>>>>> Solr examples for DIH do not have schemaless mode enabled.
>>>>>>
>>>>>> I _believe_ you can copy the schemaless URP chain and add the
>>>>>> parameter to call it to DIH handler and it _should_ work. But I am
not
>>>>>> betting on it without testing it, as DIH also has some magic code
to
>>>>>> ignore fields not defined in schema because it is designed to work
>>>>>> with only extracting relevant fields from the database even with
>>>>>> 'select *' statement.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Alex.
>>>>>> ----
>>>>>> Newsletter and resources for Solr beginners and intermediates:
>>>>>> http://www.solr-start.com/
>>>>>>
>>>>>>
>>>>>> On 10 August 2016 at 17:12, Pierre Caserta <pierre.caserta@gmail.com>
wrote:
>>>>>>> Hi,
>>>>>>> It seems that using the DataImportHandler with a XPathEntityProcessor
config
>>>>>>> with a managed-schema setup, only import the id and version field.
>>>>>>>
>>>>>>> data-config.xml
>>>>>>>
>>>>>>> <dataConfig>
>>>>>>>  <dataSource type="FileDataSource" encoding="UTF-8" />
>>>>>>>  <document>
>>>>>>>      <entity name="post"
>>>>>>>          processor="XPathEntityProcessor"
>>>>>>>          stream="true"
>>>>>>>          forEach="/posts/row/"
>>>>>>>          url="${dataimporter.request.dataurl}"
>>>>>>>
>>>>>>> transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer"
>>>>>>>>
>>>>>>>          <field column="id"        xpath="/posts/row/@Id"
/>
>>>>>>>          <field column="postTypeId"     xpath="/posts/row/@PostTypeId"
/>
>>>>>>>          <field column="acceptedAnswerId"
>>>>>>> xpath="/posts/row/@AcceptedAnswerId" />
>>>>>>>          <field column="creationDate" xpath="/posts/row/@CreationDate"
>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" />
>>>>>>>          <field column="postScore"  xpath="/posts/row/@Score"
/>
>>>>>>>          <field column="viewCount"  xpath="/posts/row/@ViewCount"
/>
>>>>>>>          <field column="body"  xpath="/posts/row/@Body" stripHTML="true"
>>>>>>> />
>>>>>>>          <field column="ownerUserId"  xpath="/posts/row/@OwnerUserId"
/>
>>>>>>>          <field column="lastEditorUserId"
>>>>>>> xpath="/posts/row/@LastEditorUserId" />
>>>>>>>          <field column="lastEditorDisplayName"
>>>>>>> xpath="/posts/row/@LastEditorDisplayName" />
>>>>>>>          <field column="lastActivityDate"
>>>>>>> xpath="/posts/row/@LastActivityDate"
>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" />
>>>>>>>          <field column="title"  xpath="/posts/row/@Title"
/>
>>>>>>>          <field column="trimmedTags" xpath="/posts/row/@Tags"
>>>>>>> regex="&lt;(.*)&gt;" />
>>>>>>>          <field column="tags" sourceColName="trimmedTags"
>>>>>>> splitBy="&gt;&lt;" />
>>>>>>>          <field column="answerCount"  xpath="/posts/row/@AnswerCount"
/>
>>>>>>>          <field column="commentCount"  xpath="/posts/row/@CommentCount"
>>>>>>> />
>>>>>>>          <field column="favoriteCount"  xpath="/posts/row/@FavoriteCount"
>>>>>>> />
>>>>>>>          <field column="communityOwnedDate"
>>>>>>> xpath="/posts/row/@CommunityOwnedDate"
>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" />
>>>>>>>      </entity>
>>>>>>>  </document>
>>>>>>> </dataConfig>
>>>>>>>
>>>>>>>
>>>>>>> http://192.168.99.100:8999/solr/solrexchange/select?indent=on&q=*:*&wt=json
>>>>>>> {
>>>>>>> "responseHeader":{
>>>>>>>  "status":0,
>>>>>>>  "QTime":0,
>>>>>>>  "params":{
>>>>>>>    "q":"*:*",
>>>>>>>    "indent":"on",
>>>>>>>    "wt":"json",
>>>>>>>    "_":"1470811193595"}},
>>>>>>> "response":{"numFound":8,"start":0,"docs":[
>>>>>>>    {
>>>>>>>      "id":"38822",
>>>>>>>      "_version_":1542258196375142400},
>>>>>>>    {
>>>>>>>      "id":"38836",
>>>>>>>      "_version_":1542258196387725312},
>>>>>>>    {
>>>>>>>      "id":"63896",
>>>>>>>      "_version_":1542258196388773888},
>>>>>>>    {
>>>>>>>      "id":"65406",
>>>>>>>      "_version_":1542258196391919616},
>>>>>>>    {
>>>>>>>      "id":"1357173",
>>>>>>>      "_version_":1542258196391919617},
>>>>>>>    {
>>>>>>>      "id":"5339763",
>>>>>>>      "_version_":1542258196392968192},
>>>>>>>    {
>>>>>>>      "id":"9932722",
>>>>>>>      "_version_":1542258196392968193},
>>>>>>>    {
>>>>>>>      "id":"9217299",
>>>>>>>      "_version_":1542258196392968194}]
>>>>>>> }}
>>>>>>>
>>>>>>> data_search.xml (8 rows)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> the url I am hitting (with custom dataurl parameter)
>>>>>>>
>>>>>>> curl
>>>>>>> 'http://192.168.99.100:8999/solr/solrexchange/dataimport?command=full-import&commit=true&dataurl=/code/solr/data/search/dih/data_search.xml'
>>>>>>>
>>>>>>> I changed my data to use <add> <doc> <field>
and use the bin/post tool and
>>>>>>> this is working as expected.
>>>>>>> Now I am interested to make it work with the DataImportHandler.
>>>>>>> How can I use the DataImportHandler to import my document ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Pierre Caserta
>>>>>>>
>>>>>>>
>>>>>
>>>
>

Mime
View raw message