manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steph van Schalkwyk <st...@remcam.net>
Subject Re: PostgreSQL version to support MCF v2.10
Date Tue, 04 Sep 2018 18:46:10 GMT
Unless I have a massive misunderstanding somewhere...




*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net>
wrote:

> Hi Karl
> I'm addressing it in the ES Output Connector.
> Not touching the framework :)
> S
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Let's make sure we're talking about the same thing.
>>
>> Here is the output connector method that receives the ID (as the
>> documentURI parameter):
>>
>>   public int addOrReplaceDocumentWithException(String documentURI,
>> VersionContext pipelineDescription, RepositoryDocument document, String
>> authorityNameString, IOutputAddActivity activities)
>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>
>> ManifoldCF doesn't say anywhere that this ID is case insensitive.  If you
>> make it case insensitive in an output connector, this will potentially
>> break a lot of things, for example incremental indexing (which organizes
>> the last indexed version by document ID).
>>
>> I therefore highly recommend that any "sloppyness" in this parameter be
>> addressed in the Repository Connector that constructs it.  If the connector
>> is crawling a repository that believes that URLs are case insensitive then
>> it should map these IDs to lower case.  If not, then it shouldn't.
>>
>> Karl
>>
>>
>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
>> wrote:
>>
>>> Hi Karl.
>>> The issue is that the ES Output Connector uses the uri to create the
>>> _id. When used with IIS which allows case variation in the URI, it creates
>>> multiple documents. Clients on Windows IIS are rarely cognizant of that
>>> issue as IIS is so lax in policing that OTB.
>>> Currently, every case variation in URI results in a new doc in the
>>> index. This is only in the ES output connector.
>>> I can add an optional checkbox to do determien that particular action if
>>> that would help?
>>> Regards,
>>> Steph
>>>
>>>
>>>
>>>
>>>
>>> *Steph van Schalkwyk*
>>> Principal, Remcam Search Engines
>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>> <https://mail.google.com/mail/u/0/#>
>>> <http://linkedin.com/in/vanschalkwyk>
>>>
>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> THanks for the update.
>>>> Lower-casing the ID would be fine except there are some connectors that
>>>> care about case.  The web connector is one such because it's up to the web
>>>> service to decide if case matters, so the web connector does not view urls
>>>> with case differences as being the same.  Other connectors also will likely
>>>> care as well. So I don't think lower-casing the document id is a smart
>>>> thing to do.
>>>>
>>>> You could add this bit of configuration to the web connector, if that's
>>>> what you are using, or to whatever other connector constructs the ID.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <steph@remcam.net>
>>>> wrote:
>>>>
>>>>> Thanks Karl.
>>>>>
>>>>> I'll look into that.
>>>>>
>>>>> Another note:
>>>>> Regarding the ES connector - I have made two additions to it and
>>>>> should probably diff them for inclusion after approval:
>>>>> 1. lowercased _id (the doc URI).
>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy sources,
>>>>> particularly IIS...)
>>>>> 3. Added a "url" metadata field to the ES connector (as ES 6.x does
>>>>> not allow accedd to _id in the schema anymore, so no copy_field etc.
from
>>>>> _id). Hence "url".
>>>>>
>>>>> Regards,
>>>>> Steph
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Steph van Schalkwyk*
>>>>> Principal, Remcam Search Engines
>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>> <https://mail.google.com/mail/u/0/#>
>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Steph, I suspect that Jetty is leaking some resource, and we may
>>>>>> need to upgrade it.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <steph@remcam.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Olivier
>>>>>>> By all means.
>>>>>>> The only issue I have seen (totally unrelated) is with Jetty,
which
>>>>>>> has to be restarted about once a week. Still trying to find the
issue.
>>>>>>> I may be overly sensitive, but I suspect MCF 2.10 with Postgres10
>>>>>>> may be a bit slower. I have no empiric evidence at the moment
as I'm still
>>>>>>> delivering the project to UAT. Will keep you posted.
>>>>>>> Regards,
>>>>>>> Steph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration (sorry
for
>>>>>>>> the late answer). I will test it soon.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>>
>>>>>>>> Olivier TAVARD
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk <steph@remcam.net>
a
>>>>>>>> écrit :
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> These are the rpm installs:
>>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.
>>>>>>>> x86_64.rpm
>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.
>>>>>>>> rhel7.x86_64.rpm
>>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.
>>>>>>>> x86_64.rpm
>>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.
>>>>>>>> x86_64.rpm
>>>>>>>>
>>>>>>>> postgresql_version: 10
>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>> postgresql_packages:
>>>>>>>> - postgresql10-libs
>>>>>>>> - postgresql10
>>>>>>>> - postgresql10-server
>>>>>>>> - postgresql10-contrib
>>>>>>>> # - postgresql10-devel
>>>>>>>>
>>>>>>>> postgresql_hba_entries:
>>>>>>>> - { type: local, database: all, user: postgres, auth_method:
peer }
>>>>>>>> - { type: local, database: all, user: all, auth_method: peer
}
>>>>>>>> - { type: host, database: all, user: all, address: '127.0.0.1/32',
>>>>>>>> auth_method: md5 }
>>>>>>>> - { type: host, database: all, user: all, address: '::1/128',
>>>>>>>> auth_method: md5 }
>>>>>>>> - { type: host, database: all, user: all, address: '0.0.0.0/0',
>>>>>>>> auth_method: md5 }
>>>>>>>> - { type: host, database: all, user: all, address: '::0/0',
>>>>>>>> auth_method: md5 }
>>>>>>>>
>>>>>>>> postgresql_global_config_options:
>>>>>>>> - option: unix_socket_directories
>>>>>>>> value: '{{ postgresql_unix_socket_directories | join(",")
}}'
>>>>>>>>
>>>>>>>> - option: standard_conforming_strings
>>>>>>>> value: 'on'
>>>>>>>>
>>>>>>>> - option: shared_buffers
>>>>>>>> value: '1024MB'
>>>>>>>>
>>>>>>>> # max_wal_size = (3 * checkpoint_segments) * 16MB
>>>>>>>> # checkpoint_segments=300
>>>>>>>> - option: max_wal_size
>>>>>>>> value: '14400MB'
>>>>>>>>
>>>>>>>> - option: min_wal_size
>>>>>>>> value: '80MB'
>>>>>>>>
>>>>>>>> - option: maintenance_work_mem
>>>>>>>> value: '2MB'
>>>>>>>>
>>>>>>>> - option: listen_addresses
>>>>>>>> value: '*'
>>>>>>>>
>>>>>>>> - option: max_connections
>>>>>>>> value: '400'
>>>>>>>>
>>>>>>>> - option: checkpoint_timeout
>>>>>>>> value: '900'
>>>>>>>>
>>>>>>>> - option: datestyle
>>>>>>>> value: "iso, mdy"
>>>>>>>>
>>>>>>>> - option: autovacuum
>>>>>>>> value: 'off'
>>>>>>>>
>>>>>>>> # vacuum all databases every night (full vacuum on Sunday
night,
>>>>>>>> lazy vacuum every night)
>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>> cron:
>>>>>>>> name: lazy_vacuum
>>>>>>>> hour: 8
>>>>>>>> minute: 0
>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze --quiet'"
>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>> cron:
>>>>>>>> name: full_vacuum
>>>>>>>> weekday: 0
>>>>>>>> hour: 10
>>>>>>>> minute: 0
>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full --analyze --quiet'"
>>>>>>>> # re-index all databases once a week
>>>>>>>> - name: add postgresql cron reindex
>>>>>>>> cron:
>>>>>>>> name: reindex
>>>>>>>> weekday: 0
>>>>>>>> hour: 12
>>>>>>>> minute: 0
>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname from
>>>>>>>> pg_database order by datname;\" | xargs -n 1 -I\"{}\" --
psql -U postgres
>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>
>>>>>>>>
>>>>>>>> This is how I run 2.10.
>>>>>>>> Been running fine for some weeks without user intervention.
>>>>>>>> @Karl: Any comments please?
>>>>>>>> Steph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Mime
View raw message