manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: PostgreSQL version to support MCF v2.10
Date Tue, 04 Sep 2018 18:33:35 GMT
Let's make sure we're talking about the same thing.

Here is the output connector method that receives the ID (as the
documentURI parameter):

  public int addOrReplaceDocumentWithException(String documentURI,
VersionContext pipelineDescription, RepositoryDocument document, String
authorityNameString, IOutputAddActivity activities)
    throws ManifoldCFException, ServiceInterruption, IOException;

ManifoldCF doesn't say anywhere that this ID is case insensitive.  If you
make it case insensitive in an output connector, this will potentially
break a lot of things, for example incremental indexing (which organizes
the last indexed version by document ID).

I therefore highly recommend that any "sloppyness" in this parameter be
addressed in the Repository Connector that constructs it.  If the connector
is crawling a repository that believes that URLs are case insensitive then
it should map these IDs to lower case.  If not, then it shouldn't.

Karl


On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net> wrote:

> Hi Karl.
> The issue is that the ES Output Connector uses the uri to create the _id.
> When used with IIS which allows case variation in the URI, it creates
> multiple documents. Clients on Windows IIS are rarely cognizant of that
> issue as IIS is so lax in policing that OTB.
> Currently, every case variation in URI results in a new doc in the index.
> This is only in the ES output connector.
> I can add an optional checkbox to do determien that particular action if
> that would help?
> Regards,
> Steph
>
>
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> THanks for the update.
>> Lower-casing the ID would be fine except there are some connectors that
>> care about case.  The web connector is one such because it's up to the web
>> service to decide if case matters, so the web connector does not view urls
>> with case differences as being the same.  Other connectors also will likely
>> care as well. So I don't think lower-casing the document id is a smart
>> thing to do.
>>
>> You could add this bit of configuration to the web connector, if that's
>> what you are using, or to whatever other connector constructs the ID.
>>
>> Karl
>>
>>
>>
>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <steph@remcam.net>
>> wrote:
>>
>>> Thanks Karl.
>>>
>>> I'll look into that.
>>>
>>> Another note:
>>> Regarding the ES connector - I have made two additions to it and should
>>> probably diff them for inclusion after approval:
>>> 1. lowercased _id (the doc URI).
>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources,
>>> particularly IIS...)
>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does not
>>> allow accedd to _id in the schema anymore, so no copy_field etc. from _id).
>>> Hence "url".
>>>
>>> Regards,
>>> Steph
>>>
>>>
>>>
>>>
>>> *Steph van Schalkwyk*
>>> Principal, Remcam Search Engines
>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>> <https://mail.google.com/mail/u/0/#>
>>> <http://linkedin.com/in/vanschalkwyk>
>>>
>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may
>>>> need to upgrade it.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <steph@remcam.net>
>>>> wrote:
>>>>
>>>>> Olivier
>>>>> By all means.
>>>>> The only issue I have seen (totally unrelated) is with Jetty, which
>>>>> has to be restarted about once a week. Still trying to find the issue.
>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10 may
>>>>> be a bit slower. I have no empiric evidence at the moment as I'm still
>>>>> delivering the project to UAT. Will keep you posted.
>>>>> Regards,
>>>>> Steph
>>>>>
>>>>>
>>>>>
>>>>> *Steph van Schalkwyk*
>>>>> Principal, Remcam Search Engines
>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>> <https://mail.google.com/mail/u/0/#>
>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry for
the
>>>>>> late answer). I will test it soon.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Olivier TAVARD
>>>>>>
>>>>>>
>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <steph@remcam.net>
a
>>>>>> écrit :
>>>>>>
>>>>>>
>>>>>>
>>>>>> These are the rpm installs:
>>>>>> -
>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>> -
>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>> -
>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>> -
>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>
>>>>>> postgresql_version: 10
>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>> postgresql_daemon: postgresql-10.service
>>>>>> postgresql_packages:
>>>>>> - postgresql10-libs
>>>>>> - postgresql10
>>>>>> - postgresql10-server
>>>>>> - postgresql10-contrib
>>>>>> # - postgresql10-devel
>>>>>>
>>>>>> postgresql_hba_entries:
>>>>>> - { type: local, database: all, user: postgres, auth_method: peer
}
>>>>>> - { type: local, database: all, user: all, auth_method: peer }
>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32',
>>>>>> auth_method: md5 }
>>>>>> - { type: host, database: all, user: all, address: '::1/128',
>>>>>> auth_method: md5 }
>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0',
>>>>>> auth_method: md5 }
>>>>>> - { type: host, database: all, user: all, address: '::0/0',
>>>>>> auth_method: md5 }
>>>>>>
>>>>>> postgresql_global_config_options:
>>>>>> - option: unix_socket_directories
>>>>>> value: '{{ postgresql_unix_socket_directories | join(",") }}'
>>>>>>
>>>>>> - option: standard_conforming_strings
>>>>>> value: 'on'
>>>>>>
>>>>>> - option: shared_buffers
>>>>>> value: '1024MB'
>>>>>>
>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>> # checkpoint_segments=300
>>>>>> - option: max_wal_size
>>>>>> value: '14400MB'
>>>>>>
>>>>>> - option: min_wal_size
>>>>>> value: '80MB'
>>>>>>
>>>>>> - option: maintenance_work_mem
>>>>>> value: '2MB'
>>>>>>
>>>>>> - option: listen_addresses
>>>>>> value: '*'
>>>>>>
>>>>>> - option: max_connections
>>>>>> value: '400'
>>>>>>
>>>>>> - option: checkpoint_timeout
>>>>>> value: '900'
>>>>>>
>>>>>> - option: datestyle
>>>>>> value: "iso, mdy"
>>>>>>
>>>>>> - option: autovacuum
>>>>>> value: 'off'
>>>>>>
>>>>>> # vacuum all databases every night (full vacuum on Sunday night,
lazy
>>>>>> vacuum every night)
>>>>>> - name: add postgresql cron lazy vacuum
>>>>>> cron:
>>>>>> name: lazy_vacuum
>>>>>> hour: 8
>>>>>> minute: 0
>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
>>>>>> - name: add postgresql cron full vacuum
>>>>>> cron:
>>>>>> name: full_vacuum
>>>>>> weekday: 0
>>>>>> hour: 10
>>>>>> minute: 0
>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'"
>>>>>> # re-index all databases once a week
>>>>>> - name: add postgresql cron reindex
>>>>>> cron:
>>>>>> name: reindex
>>>>>> weekday: 0
>>>>>> hour: 12
>>>>>> minute: 0
>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from pg_database
>>>>>> order by datname;\" | xargs -n 1 -I\"{}\" -- psql -U postgres {}
-c
>>>>>> \"reindex database {};\"' "
>>>>>>
>>>>>>
>>>>>> This is how I run 2.10.
>>>>>> Been running fine for some weeks without user intervention.
>>>>>> @Karl: Any comments please?
>>>>>> Steph
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Mime
View raw message