manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: PostgreSQL version to support MCF v2.10
Date Wed, 05 Sep 2018 16:05:53 GMT
I'm already working on the Web Connector.  The UI has problems that predate
this change and I've alerted Kishore about them -- he'll look into them
later today.

Karl


On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <steph@remcam.net>
wrote:

> Thank you Karl.
> You are of course correct in that the incremental crawl is now broken in
> that it does a full crawl every time.
> I'll jump on the Web Connector and add that functionality.
> Thanks for this excellent application and all the help over the years.
> Steph
>
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> The patch I uploaded doesn't work because the entire tab is broken; looks
>> like the UI refactoring broke it and it was never reported.  Fixing now.
>> Karl
>>
>>
>> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> I coded up the web connector feature I think we need.  See
>>> CONNECTORS-1528; I've attached a patch.  Please apply and test it out to
>>> see if it solves the case problem for your IIS site.
>>>
>>> For the "//" issue, can you be more specific about the mapping you need
>>> to do?
>>>
>>> Karl
>>>
>>>
>>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Steph,
>>>>
>>>> Right, you wouldn't want to touch the framework.
>>>>
>>>> The effect of lower-casing the documentURI parameter in the
>>>> addOrReplaceDocumentWithException method in an output connector would be
to
>>>> map multiple, independently-fetched, documents that differ only by the case
>>>> of the URL together into one document in the index.  The ManifoldCF
>>>> assumption is that a document with a certain URI can be tracked in the
>>>> index using exactly that URI.  Mapping the URI to lower case would break
>>>> that assumption so the framework would make the wrong decision in many
>>>> cases.
>>>>
>>>> If you are picking up documents using the web connector, therefore, and
>>>> you are getting duplicate documents because the document URLs are sloppy,
>>>> it is therefore essential that INSTEAD of mapping the document URI to lower
>>>> case in the output connector, you map to lower case in the repository
>>>> connector.  Otherwise the framework will not work right.
>>>>
>>>> There is a tab in the web connector that allows you to configure URL
>>>> normalization, called "Canonicalization".  This would be a very appropriate
>>>> place to add URL mapping to lower case.  It should be as simple as adding
>>>> one more checkbox column in the table, and modifying the method that does
>>>> the URL processing to include lower-casing.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net>
>>>> wrote:
>>>>
>>>>> Unless I have a massive misunderstanding somewhere...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Steph van Schalkwyk*
>>>>> Principal, Remcam Search Engines
>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>> <https://mail.google.com/mail/u/0/#>
>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net>
>>>>> wrote:
>>>>>
>>>>>> Hi Karl
>>>>>> I'm addressing it in the ES Output Connector.
>>>>>> Not touching the framework :)
>>>>>> S
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Steph van Schalkwyk*
>>>>>> Principal, Remcam Search Engines
>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Let's make sure we're talking about the same thing.
>>>>>>>
>>>>>>> Here is the output connector method that receives the ID (as
the
>>>>>>> documentURI parameter):
>>>>>>>
>>>>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>>>>> VersionContext pipelineDescription, RepositoryDocument document,
String
>>>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>>>>
>>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.
>>>>>>> If you make it case insensitive in an output connector, this
will
>>>>>>> potentially break a lot of things, for example incremental indexing
(which
>>>>>>> organizes the last indexed version by document ID).
>>>>>>>
>>>>>>> I therefore highly recommend that any "sloppyness" in this parameter
>>>>>>> be addressed in the Repository Connector that constructs it.
 If the
>>>>>>> connector is crawling a repository that believes that URLs are
case
>>>>>>> insensitive then it should map these IDs to lower case.  If not,
then it
>>>>>>> shouldn't.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Karl.
>>>>>>>> The issue is that the ES Output Connector uses the uri to
create
>>>>>>>> the _id. When used with IIS which allows case variation in
the URI, it
>>>>>>>> creates multiple documents. Clients on Windows IIS are rarely
cognizant of
>>>>>>>> that issue as IIS is so lax in policing that OTB.
>>>>>>>> Currently, every case variation in URI results in a new doc
in the
>>>>>>>> index. This is only in the ES output connector.
>>>>>>>> I can add an optional checkbox to do determien that particular
>>>>>>>> action if that would help?
>>>>>>>> Regards,
>>>>>>>> Steph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Steph van Schalkwyk*
>>>>>>>> Principal, Remcam Search Engines
>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> THanks for the update.
>>>>>>>>> Lower-casing the ID would be fine except there are some
connectors
>>>>>>>>> that care about case.  The web connector is one such
because it's up to the
>>>>>>>>> web service to decide if case matters, so the web connector
does not view
>>>>>>>>> urls with case differences as being the same.  Other
connectors also will
>>>>>>>>> likely care as well. So I don't think lower-casing the
document id is a
>>>>>>>>> smart thing to do.
>>>>>>>>>
>>>>>>>>> You could add this bit of configuration to the web connector,
if
>>>>>>>>> that's what you are using, or to whatever other connector
constructs the ID.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <
>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Karl.
>>>>>>>>>>
>>>>>>>>>> I'll look into that.
>>>>>>>>>>
>>>>>>>>>> Another note:
>>>>>>>>>> Regarding the ES connector - I have made two additions
to it and
>>>>>>>>>> should probably diff them for inclusion after approval:
>>>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have
sloppy
>>>>>>>>>> sources, particularly IIS...)
>>>>>>>>>> 3. Added a "url" metadata field to the ES connector
(as ES 6.x
>>>>>>>>>> does not allow accedd to _id in the schema anymore,
so no copy_field etc.
>>>>>>>>>> from _id). Hence "url".
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Steph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some
resource, and we
>>>>>>>>>>> may need to upgrade it.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk
<
>>>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Olivier
>>>>>>>>>>>> By all means.
>>>>>>>>>>>> The only issue I have seen (totally unrelated)
is with Jetty,
>>>>>>>>>>>> which has to be restarted about once a week.
Still trying to find the issue.
>>>>>>>>>>>> I may be overly sensitive, but I suspect
MCF 2.10 with
>>>>>>>>>>>> Postgres10 may be a bit slower. I have no
empiric evidence at the moment as
>>>>>>>>>>>> I'm still delivering the project to UAT.
Will keep you posted.
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Steph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 
  steph@remcam.net
>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard
<
>>>>>>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL
configuration (sorry
>>>>>>>>>>>>> for the late answer). I will test it
soon.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van
Schalkwyk <steph@remcam.net>
>>>>>>>>>>>>> a écrit :
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>>>> -
>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>> -
>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>> -
>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>> -
>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>> -
>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>
>>>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>>>> - postgresql10
>>>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>>>
>>>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>>>> - { type: local, database: all, user:
postgres, auth_method:
>>>>>>>>>>>>> peer }
>>>>>>>>>>>>> - { type: local, database: all, user:
all, auth_method: peer }
>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '
>>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 }
>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '::1/128',
>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '0.0.0.0/0',
>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>> - { type: host, database: all, user:
all, address: '::0/0',
>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>
>>>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories
| join(",") }}'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>>>> value: 'on'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>>>
>>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments)
* 16MB
>>>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>>>> value: '*'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: max_connections
>>>>>>>>>>>>> value: '400'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>>>> value: '900'
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: datestyle
>>>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>>>
>>>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>>>> value: 'off'
>>>>>>>>>>>>>
>>>>>>>>>>>>> # vacuum all databases every night (full
vacuum on Sunday
>>>>>>>>>>>>> night, lazy vacuum every night)
>>>>>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>>>>>> cron:
>>>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>>>> hour: 8
>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all
--analyze --quiet'"
>>>>>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>>>>>> cron:
>>>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>> hour: 10
>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all
--full --analyze
>>>>>>>>>>>>> --quiet'"
>>>>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>>>> cron:
>>>>>>>>>>>>> name: reindex
>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>> hour: 12
>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select
datname from
>>>>>>>>>>>>> pg_database order by datname;\" | xargs
-n 1 -I\"{}\" -- psql -U postgres
>>>>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>>>> Been running fine for some weeks without
user intervention.
>>>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>

Mime
View raw message