manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: PostgreSQL version to support MCF v2.10
Date Wed, 05 Sep 2018 11:33:39 GMT
The patch I uploaded doesn't work because the entire tab is broken; looks
like the UI refactoring broke it and it was never reported.  Fixing now.
Karl


On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddywri@gmail.com> wrote:

> I coded up the web connector feature I think we need.  See
> CONNECTORS-1528; I've attached a patch.  Please apply and test it out to
> see if it solves the case problem for your IIS site.
>
> For the "//" issue, can you be more specific about the mapping you need to
> do?
>
> Karl
>
>
> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Steph,
>>
>> Right, you wouldn't want to touch the framework.
>>
>> The effect of lower-casing the documentURI parameter in the
>> addOrReplaceDocumentWithException method in an output connector would be to
>> map multiple, independently-fetched, documents that differ only by the case
>> of the URL together into one document in the index.  The ManifoldCF
>> assumption is that a document with a certain URI can be tracked in the
>> index using exactly that URI.  Mapping the URI to lower case would break
>> that assumption so the framework would make the wrong decision in many
>> cases.
>>
>> If you are picking up documents using the web connector, therefore, and
>> you are getting duplicate documents because the document URLs are sloppy,
>> it is therefore essential that INSTEAD of mapping the document URI to lower
>> case in the output connector, you map to lower case in the repository
>> connector.  Otherwise the framework will not work right.
>>
>> There is a tab in the web connector that allows you to configure URL
>> normalization, called "Canonicalization".  This would be a very appropriate
>> place to add URL mapping to lower case.  It should be as simple as adding
>> one more checkbox column in the table, and modifying the method that does
>> the URL processing to include lower-casing.
>>
>> Karl
>>
>>
>>
>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net>
>> wrote:
>>
>>> Unless I have a massive misunderstanding somewhere...
>>>
>>>
>>>
>>>
>>> *Steph van Schalkwyk*
>>> Principal, Remcam Search Engines
>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>> <https://mail.google.com/mail/u/0/#>
>>> <http://linkedin.com/in/vanschalkwyk>
>>>
>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net>
>>> wrote:
>>>
>>>> Hi Karl
>>>> I'm addressing it in the ES Output Connector.
>>>> Not touching the framework :)
>>>> S
>>>>
>>>>
>>>>
>>>> *Steph van Schalkwyk*
>>>> Principal, Remcam Search Engines
>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>> <https://mail.google.com/mail/u/0/#>
>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>
>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> Let's make sure we're talking about the same thing.
>>>>>
>>>>> Here is the output connector method that receives the ID (as the
>>>>> documentURI parameter):
>>>>>
>>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>>> VersionContext pipelineDescription, RepositoryDocument document, String
>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>>
>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.  If
>>>>> you make it case insensitive in an output connector, this will potentially
>>>>> break a lot of things, for example incremental indexing (which organizes
>>>>> the last indexed version by document ID).
>>>>>
>>>>> I therefore highly recommend that any "sloppyness" in this parameter
>>>>> be addressed in the Repository Connector that constructs it.  If the
>>>>> connector is crawling a repository that believes that URLs are case
>>>>> insensitive then it should map these IDs to lower case.  If not, then
it
>>>>> shouldn't.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl.
>>>>>> The issue is that the ES Output Connector uses the uri to create
the
>>>>>> _id. When used with IIS which allows case variation in the URI, it
creates
>>>>>> multiple documents. Clients on Windows IIS are rarely cognizant of
that
>>>>>> issue as IIS is so lax in policing that OTB.
>>>>>> Currently, every case variation in URI results in a new doc in the
>>>>>> index. This is only in the ES output connector.
>>>>>> I can add an optional checkbox to do determien that particular action
>>>>>> if that would help?
>>>>>> Regards,
>>>>>> Steph
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Steph van Schalkwyk*
>>>>>> Principal, Remcam Search Engines
>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> THanks for the update.
>>>>>>> Lower-casing the ID would be fine except there are some connectors
>>>>>>> that care about case.  The web connector is one such because
it's up to the
>>>>>>> web service to decide if case matters, so the web connector does
not view
>>>>>>> urls with case differences as being the same.  Other connectors
also will
>>>>>>> likely care as well. So I don't think lower-casing the document
id is a
>>>>>>> smart thing to do.
>>>>>>>
>>>>>>> You could add this bit of configuration to the web connector,
if
>>>>>>> that's what you are using, or to whatever other connector constructs
the ID.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <
>>>>>>> steph@remcam.net> wrote:
>>>>>>>
>>>>>>>> Thanks Karl.
>>>>>>>>
>>>>>>>> I'll look into that.
>>>>>>>>
>>>>>>>> Another note:
>>>>>>>> Regarding the ES connector - I have made two additions to
it and
>>>>>>>> should probably diff them for inclusion after approval:
>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy
sources,
>>>>>>>> particularly IIS...)
>>>>>>>> 3. Added a "url" metadata field to the ES connector (as ES
6.x does
>>>>>>>> not allow accedd to _id in the schema anymore, so no copy_field
etc. from
>>>>>>>> _id). Hence "url".
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Steph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Steph van Schalkwyk*
>>>>>>>> Principal, Remcam Search Engines
>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource,
and we
>>>>>>>>> may need to upgrade it.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk <
>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>
>>>>>>>>>> Olivier
>>>>>>>>>> By all means.
>>>>>>>>>> The only issue I have seen (totally unrelated) is
with Jetty,
>>>>>>>>>> which has to be restarted about once a week. Still
trying to find the issue.
>>>>>>>>>> I may be overly sensitive, but I suspect MCF 2.10
with Postgres10
>>>>>>>>>> may be a bit slower. I have no empiric evidence at
the moment as I'm still
>>>>>>>>>> delivering the project to UAT. Will keep you posted.
>>>>>>>>>> Regards,
>>>>>>>>>> Steph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard <
>>>>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL configuration
(sorry
>>>>>>>>>>> for the late answer). I will test it soon.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk
<steph@remcam.net>
>>>>>>>>>>> a écrit :
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>> -
>>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>> -
>>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>> -
>>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>> -
>>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>> -
>>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>
>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>> - postgresql10
>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>
>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>> - { type: local, database: all, user: postgres,
auth_method:
>>>>>>>>>>> peer }
>>>>>>>>>>> - { type: local, database: all, user: all, auth_method:
peer }
>>>>>>>>>>> - { type: host, database: all, user: all, address:
'127.0.0.1/32
>>>>>>>>>>> ', auth_method: md5 }
>>>>>>>>>>> - { type: host, database: all, user: all, address:
'::1/128',
>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>> - { type: host, database: all, user: all, address:
'0.0.0.0/0',
>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>> - { type: host, database: all, user: all, address:
'::0/0',
>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>
>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories
| join(",") }}'
>>>>>>>>>>>
>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>> value: 'on'
>>>>>>>>>>>
>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>
>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments) *
16MB
>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>
>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>
>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>
>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>> value: '*'
>>>>>>>>>>>
>>>>>>>>>>> - option: max_connections
>>>>>>>>>>> value: '400'
>>>>>>>>>>>
>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>> value: '900'
>>>>>>>>>>>
>>>>>>>>>>> - option: datestyle
>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>
>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>> value: 'off'
>>>>>>>>>>>
>>>>>>>>>>> # vacuum all databases every night (full vacuum
on Sunday night,
>>>>>>>>>>> lazy vacuum every night)
>>>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>>>> cron:
>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>> hour: 8
>>>>>>>>>>> minute: 0
>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze
--quiet'"
>>>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>>>> cron:
>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>> weekday: 0
>>>>>>>>>>> hour: 10
>>>>>>>>>>> minute: 0
>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full
--analyze
>>>>>>>>>>> --quiet'"
>>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>> cron:
>>>>>>>>>>> name: reindex
>>>>>>>>>>> weekday: 0
>>>>>>>>>>> hour: 12
>>>>>>>>>>> minute: 0
>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select datname
from
>>>>>>>>>>> pg_database order by datname;\" | xargs -n 1
-I\"{}\" -- psql -U postgres
>>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>> Been running fine for some weeks without user
intervention.
>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>> Steph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>

Mime
View raw message