SSO

After my last post I thought I ought to write something about Single Sign On (SSO) – this post will cover a bit more than just SSO.

I’ve done a lot of work with SSO but it’s one of those things you only visit periodically so it’s easy to forget things – also it’s harder than it feels like it should be!

This post is intended to be fairly generic, although I will use some examples from applications that I’ve worked on.

For anyone especially interested in Alfresco please note that I haven’t looked at any of the version 6/Identity Service or ADF stuff as yet.

My first point is that if you’re working on a new SSO project then the first thing you need to do is to work out how you are going to merge your user data from the different apps – this is generally the hardest part and I’ve seen quite a number of projects give up because of this. It’s a good argument for getting your SSO sorted out as soon as possible.

Single Sign On is where you log in once and are authenticated to all applications (also potentially logged out).

Shared Sign On is where you use a shared user dictionary e.g. LDAP but each application is responsible for it’s own authentication.

Authentication is who are you, authorization is what you can do.

There are many, many guides to this on the internet and if you’re really, really lucky you might find one you can understand.

Single Sign On

My first point is that if you do this right then the protocol/technology doesn’t matter all that much – this area is a lot more mature than it was even a couple of years ago.

What you are trying to achieve is to protect a list of endpoints (URLs) (probably not all e.g. CSS) and communicate a user id through to your application.

Try and keep this as (logically) separate as possible – just identify the user. It can be tempting to link this up to authorization but try not to, it just causes trouble.

The aim here is to intercept the incoming request and process it before it gets to your application. There are different ways of doing this e.g. in Apache, in Tomcat, in filters.

For java what you are aiming for is that a call to request.getRemoteUser() will get you the user id.

For java applications my preference is to try and use a web-fragment.xml to define the filters and endpoints. The problem with this is that any filters defined in the web-fragment are applied after filters in the web.xml so depending on how the application is structured this may not be possible (sometimes you can get away with writing another application specific filter but that’s not ideal – I managed to do this for 5.1 Share and repo but not 5.2 Share).

See https://issues.alfresco.com/jira/browse/ALF-21848 for a suggestion for restructuring the Share web.xml to make this easier.

Another common approach is to just edit the web.xml, sometimes using a maven profile, but that’s not ideal.

A quick aside here – be careful about your username/id – for example we log in using email address but use a different attribute as the user id. Single sign on systems will support this but you might need to be careful in your configuration.

Username/password log in

Sometimes you will want to log in using a username and password – this might be for using a non web or mobile client e.g. in Alfresco CMIS, mobile or IMAP.

Probably the cleanest choice here is to make the client do the work by obtaining a token from SSO and using that in conjunction with your SSO mechanism of choice, however that’s not always practical.

Another alternative is to identify the request, one way is look for the authorization header, and proceed from there either by using SSO proxy authentication (see below) or carrying on to your normal username/password auth method (note Alfresco doesn’t support using a different username attribute out of the box)

Proxy Authentication

This is where you have logged in to one application and want to pass that authentication information through to another.

The idea here is that after the client application (C) has authenticated and thus allowed the server side application (A) to authenticate itself using the SSO mechanism then application A can obtain a token from the SSO server and passes that token through to the other server side application (B) it wants to talk to. Application B then validates the token against the SSO server and as a result obtains the identity information.

What you are doing is intercepting and wrapping the outgoing request from A to B and including the SSO token (typically as part of a header but this can be handled by a library)

An Alfresco specific aside, at least up to version 5.2, Share SSO communicates with the repo/platform/ACS (whatever you want to call it…) by setting a custom HTTP header containing the username (you have to be careful about the security of your configuration!)

So in summary this splits into the following parts:

  • Application A obtains an access token from the SSO server (standard part of libraries)
  • Application A injects the access token into requests to application B (depends on how requests are made)
  • Application B intercepts the request and uses the token to obtain the username (this will be the same configuration as is used for the normal SSO authentication, and subject to the same security constraints)

Authorization

As I said earlier try and keep authentication and authorization separate.

It probably helps to understand this if you consider the evolution of (a lot of) applications.

Start off with a custom authentication and authorization layer, then realize that you need to use shared authentication (probably keeping the original method) so add in synchronization (with LDAP and others), then you realize that you want SSO so add that in later.

There are a few approaches you can take here:

(consider the performance implications, how look ups will be cached – you’ll be doing this a lot and the caching mechanisms, time out etc are not as well considered as for authorization)

Native SSO

Most SSO systems support some form of attribute release – you can use this information to set the rights within your application e.g. OAuth scope or parse a list of LDAP group memberships.

This means that the application must be used in an SSO context.

This can be, and probably would be, done with the same validation request as is used for authorization but do try to keep it logically separate.

Mixed SSO/Shared Auth backend

Use the SSO authorization and then an additional query to the shared backend to determine rights e.g. run a query against LDAP to determine group membership.

Mixed SSO/Custom/Shared Auth

This is probably the most common model that I’ve seen even though it’s relatively painful to configure.

Custom authorization model is set to sync with the shared auth backend.

Use the SSO authorization and then an additional query to the custom model to determine rights.

Custom model can contain rights(groups/group membership) not held in the shared backend.

There can be timing problems waiting for the custom model to be updated from the shared backend.

(Technically can be done without the shared auth but that’s really not a good idea so I’m not including it)

Proxy authorization service

Make a proxy authenticated request to a separate service which can manage the lookup(s)

This is a more flexible version of native SSO but comes with potential performance issues.

User Information

I’m throwing this in as an extra because it’s pretty similar to authorization in concept.

You might want to have information about the user, for example, their name to use in your application.

This information can be retrieved using the same methods as for the authorization information.

I’ve put this separately mainly because if you want to use an avatar or picture for the user then you potentially have a much larger piece of data to consider (and might want to convert the image into a more suitable format)

(Alfresco note – the use of the an image isn’t supported out of the box but it’s something that can be done with customizations)

Authorization Management

Chances are that if you are using SSO then there will some external process for managing authorization e.g. LDAP groups.

If you want to manage authorization from within your application then ideally you need to authenticate against the existing management system before making any changes – this is one of the few occasions where you may want to be able to retrieve the user password from the SSO system e.g. to authenticate against LDAP. The alternative is use some form of super user from within your application but that isn’t ideal as it potentially gives elevated rights to your user by mistake.

Single Sign Out

You may not care about this e.g. by default Share in SSO configuration removes the log out option from the menu and there’s no way to log out. (It’s not too hard to put back in however – see previous posts and the alfresco-cas project on github)

Most applications have their own way of determining whether you are logged in as well as whatever is used by SSO. This is to support non SSO log ins.

The important thing to remember here is that you are logged in to both the SSO system and the application and when you log out of one, you need to log out of both (otherwise you’ll just be logged straight back in again)

Normally the idea would be to log out of the application first and then forward to the SSO logout page (probably with a redirection parameter to send you somewhere afterwards).

This doesn’t cover the more complex case where the user logs out of a different application. In this case the SSO application will send a logout request to your application (as part of the SSO logout process) – this logout request can then be handled to ensure that the local application logout also happens. i.e. logging out from application A also logs you out from applications B, C, D… as well as SSO.

It’s common for the first case to be handled but more unusual to handle the second case.

Summary

SSO is in a much better place than it was a few years ago but there’s still no one right way to do it and that’s likely to remain the case. (I’ve spent a lot of time helping people with CAS SSO for Alfresco)

SSO brings big benefits, both to the users and by providing a single place to manage authentication and, potentially, authorization.

Be careful! SSO provides a single point of failure for the organization and while this can mitigated by suitable configuration you still need to be careful especially during upgrades.

Don’t forget to keep on top of the upgrades!

Keep it as simple as possible as any customization makes upgrading more difficult (you are keeping on top of the upgrades aren’t you?)

Try to keep to only using one SSO system otherwise it’s not really SSO, as well as being more difficult to maintain – you’ll probably end up with some sort of shared LDAP based backend e.g. OpenLDAP or ActiveDirectory.

Make sure that you keep authentication and authorization (logically) separate.

If you write your application in a suitably flexible way then it should be possible to easily support any current or future protocols. To achieve this there are three main parts to consider:

  • Make sure the client application can handle the authentication protocol – don’t forget authentication failures (should be fairly easy with client libraries, response interceptors etc)
  • abstract the mechanism for protecting incoming requests – how to specify the endpoints to protect, and how to pass the validated information through to your application. (e.g. web-fragment.xml and getRemoteUser for java)
  • Provide an abstraction layer for proxy requests i.e. make sure it’s easy to modify any requests between different application components.

 

Alfresco DevCon 2019

Alfresco DevCon – Some thoughts and impressions

These are very much impressions and if you want more details then I encourage you to go and look at the presentations – there’s a lot of good stuff there!

I haven’t really been keeping up with what’s going on in the Alfresco ecosystem for a couple of years so I thought that DevCon would be a good opportunity to catch up.

It’s always nice to get an idea of what’s going on and meet up with friends old and new.

One of the things I like about DevCon is that it gives me a change to get a, fairly conservative, view on what’s going on in the technology space and well as the more specific Alfresco related topics.

There were about 300 attendees and four parallel streams of talks – I like the lightning talk format with auto advancing slides and five minutes per presenter – it’s a different skill but allows you to hear from more people.

John Newton’s key note is always a good introduction and provides context to the rest of the conference.
The overall feeling of this was surprising similar to Zaragoza two years ago – I think this is good as it shows more consistency of vision that has sometimes been the case in the past.
Joining up process, process driven, and content feels like a good thing to be doing.
Once again AI was prominent, to be honest, despite working at the Big Data Institute, AI (or perhaps Machine Learning(ML) would be a better term), I don’t find this very interesting in the Alfresco context. As the Hackathons this year, and even two years ago, have shown this is fairly easy to link up to Alfresco so the interesting part is what you do when you’ve linked it up and that’s going to be very specific to your own needs.
The other hype technology mentioned was blockchain – this seems fairly sensible in the context of governance services.

Moving on the road map and architecture.
It felt like Alfresco were embarking on a fairly major change after 5.2 and I certainly thought that the right thing to do would be to wait a couple of versions before attempting to upgrade, and I don’t think I’m alone in that!
My impression is that it’s still not there yet although progress has been made and it seems to be heading in a generally good direction – more on this below.

The blog posted by Jeff Potts @jeffpotts01 about whether Alfresco should provide more than Lego bricks caused a few conversations in the Q&As – I suggest that the CEO and product managers have a conversation here as I got conflicting messages from the two sessions. I’m firmly in the camp that there needs to be a reasonably functional application if only to show what can be done by putting the Lego together.

One the technology themes here was that containerisation is coming of age. This is the same direction as we have been going internally so it’s good to have some confirmation this.
We’ve got experience of Kubernetes so we’re not afraid of this – it’s perhaps overkill for a lot of installations but does have the benefit of being relatively cloud platform agnostic (interestingly Azure was seen as the second most popular platform to AWS when our observation is that the Google Platform is the best option)
Another observation is that as we’re being pushed more towards cloud architectures it would be nice if the official S3 object storage connector was released to the community (there is already a community version), and perhaps other object storage connectors?

Another potentially huge change/improvement is the introduction of an event bus into a right sized (as Brian said not micro) services architecture, this is one of the presentations I need to catch up on at home. I was advocating ESB based service architectures ten years ago so it’s nice to see this sort of thing happening.  Jeff Potts did a nice presentation of how he had effectively done his own implementation of this.

(As an aside if you’re spec’ing out new hardware get 32Gb if you can)
Kubernates
The ADF related applications feel a bit like where Share was in about 2010 – I like this but it’s not yet a compelling reason to upgrade an existing installation.
Where I’d like to be is to be able to run Share alongside the new stuff with the objective of phasing out Share eventually – I don’t feel we’re there yet.
(good to see the changes for ADF 3 but puzzled as to how the old stuff was there when using OpenAPI as the new stuff seems to match what would come out of the code generators…)

One of my objectives was to get a feel for how hard it would be to upgrade, and what the benefits would be and as Angel (@AngelBorroy) and David’s (@davidcognite) sessions were the most popular, lots of other people had the same idea (Thanks for the shout out Angel). I’m also going to mention Bindu’s (@binduwavell) question in the CEO Q&A here.

This is tricky, and judging from this, and some informal conversations, there’s still a real lack of help and support from both Alfresco and partners in this area.
One of the strengths of Alfresco is the ability to provide your own customisations but it does, potentially, make it difficult to upgrade.
I think the new architecture is a step in the right direction here as it’s going to make it easier to introduce some loosely coupled customisations – there will still be a place for the old way however.

My biggest problem with upgrade is SSO, it always breaks!,  so I was very interested to see the introduction of the Alfresco Identity Service – it’s great to see this area getting some love.

I really want this to work but I’m pretty disappointed with what was presented.

Keycloak is solid choice for SSO but I *really* don’t want to introduce it running in parallel with my existing SSO infrastructure – by all means have it there as an option for people who don’t have existing infrastructure (or for test/dev environs) but please try and do a better job of integrating with existing, standards based, installations – this is quite a mature area now.

No integration with Share is a pretty big absence (and I’m assuming mobile) – I suspect that the changes to ACS for this mean that the existing SSO won’t work any more (I’ve seen it broken but not sure why yet)

In principal I agree with the aim of moving the authentication chain out of the back end but there may be some side effects e.g. one conversation I had was around the frustration of not being able to use the jpegPhoto from LDAP as the avatar in Alfresco – this is fairly easy to provide as a customization to the LDAP sync (I’ve done it and can share the code) but doesn’t fit so well if you’re getting the identity information from an identity service.

All in all an enjoyable conference with lots learned.

P.S. One option for the future would be to reduce the carbon footprint of the conference by holding it in Oxford – it’s possible to use one of the colleges when the students aren’t there e.g. end of June to October.

Edinburgh was nice but I think a few people found it a wee bit chilly.

Simple Kibana monitoring (for Alfresco)

This post is inspired by https://github.com/miguel-rodriguez/alfresco-monitoring and there’s a lot of useful info in there.

The aim of this post is to allow you to quickly get up and running and monitoring your logs.

I’m using puppet to install, even though we don’t have a puppet master, as there are modules provided by Elastic Search that make it easy to install and configure the infrastructure.

If you’re not sure how to use puppet look at my post Puppet – the very basics

Files to go with this post are available on github

I’m running on Ubuntu 16.04 and at the end have

  • elasticsearch 5.2.1
  • logstash 1.5.2
  • kibana 5.2.1

The kibana instance will be running on port 5601

Elastic Search

Elastic Search puppet module
Logstash puppet module

puppet module install elastic-elasticsearch --version 5.0.0
puppet module install elastic-logstash --version 5.0.4

The manifest file

class { 'elasticsearch':
java_install => true,
manage_repo => true,
repo_version => '5.x',
}

elasticsearch::instance { 'es-01': }
elasticsearch::plugin { 'x-pack': instances => 'es-01' }

include logstash

# You must provide a valid pipeline configuration for the service to start.
logstash::configfile { 'my_ls_config':
content => template('wrighting-logstash/logstash_conf.erb'),
}

logstash::plugin { 'logstash-input-beats': }
logstash::plugin { 'logstash-filter-grok': }
logstash::plugin { 'logstash-filter-mutate': }

Configuration – server


puppet apply --verbose --detailed-exitcodes /etc/puppetlabs/code/environments/production/manifests/elk.pp

/etc/puppetlabs/code/modules/wrighting-logstash/templates/logstash_conf.erb

Configuration – client

This is a fairly big change over the alfresco-monitoring configuration as it uses beats to publish the information to the logstash instance running on the server.

For simplicity I’m not using redis.

Links for more information or just use the code below
https://www.elastic.co/guide/en/beats/libbeat/5.2/getting-started.html
https://www.elastic.co/guide/en/beats/libbeat/5.2/setup-repositories.html


wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
apt-get update
apt-get install filebeat metricbeat

Partly for convenience I choose to install both beats on the ELK server and connect them directly to elasticsearch (the default) before installing elsewhere. This has the advantage of automatically loading the elasticsearch template file

You should normally disable the elasticsearch output and enable the logstash output if you are sending tomcat files

Filebeat


curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-5.2.1-amd64.deb
sudo dpkg -i filebeat-5.2.1-amd64.deb

https://www.elastic.co/guide/en/beats/filebeat/5.2/config-filebeat-logstash.html
Note that this configuration implies the change made to the tomcat access log configuration in server.xml

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"
prefix="access-" suffix=".log"
pattern='%a %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i" %D "%I"'
resolveHosts="false"/>

Edit /etc/filebeat/filebeat.yml

 filebeat.prospectors:
 # Each - is a prospector. Most options can be set at the prospector level, so
 # you can use different prospectors for various configurations.
 # Below are the prospector specific configurations.
 - input_type: log
 # Paths that should be crawled and fetched. Glob based paths.
 paths:
 - /var/log/tomcat7/access-*.log
 tags: [ "TomcatAccessLog" ]

- input_type: log
 # Paths that should be crawled and fetched. Glob based paths.
 paths:
 - /var/log/tomcat7/alfresco.log
 tags: [ "alfrescoLog" ]

- input_type: log
 # Paths that should be crawled and fetched. Glob based paths.
 paths:
 - /var/log/tomcat7/share.log
 tags: [ "shareLog" ]

output.logstash:
 hosts: ["127.0.0.1:5044"]

Don’t forget to start the service!

If you are using the filebeat apache2 module then check your error.log as you may need to configure the access for the apache2 status module

Metric Beat

Check the documentation but it’s probably OK to mostly leave the defaults

Port forwarding

If you need to set up port forwarding the following will do it.
Edit .ssh/config

Host my.filebeats.client
RemoteForward my.filebeats.client:5044 localhost:5044

Then
ssh -N my.filebeats.client &
Note you will need to restart if you change/restart logstash

Logstash config

Look at the logstash_conf.erb file.

Changes from alfresco config

  • You will need to change [type] to [tags]
  • multi-line is part of the input, not the filters – note this could be done in the filebeat config
  • jmx filters removed as I’m using community edition
  • system filters removed as I’m using the metricbeat supplied configuration

Exploring ElasticSearch

https://www.elastic.co/guide/en/elasticsearch/reference/1.4/_introducing_the_query_language.html
View the indexes
curl 'localhost:9200/_cat/indices?v'
Look at some content – defaults to 10 results
curl 'localhost:9200/filebeat-2017.02.23/_search?q=*&pretty'
Look at some content with a query
curl -XPOST 'localhost:9200/filebeat-2017.02.23/_search?pretty' -d@query.json
query.json

{
"query": { "match": { "tags": "TomcatAccessLog"} },
"size": 10,
"sort": { "@timestamp": { "order": "desc"}}
}

Kibana

Set up

puppet module install cristifalcas-kibana --version 5.0.1
This gives you a kibana install running on http://localhost:5601

Note that this is Kibana version 5

If you are having trouble with fields not showing up try – Management -> Index Patterns -> refresh and/or reload the templates

curl -XPUT 'http://localhost:9200/_template/metricbeat' -d@/etc/metricbeat/metricbeat.template.json 
curl -XPUT 'http://localhost:9200/_template/filebeat' -d@/etc/filebeat/filebeat.template.json 

A good place to start is to load the Default beats dashboards

/usr/share/metricbeat/scripts/import_dashboards
/usr/share/filebeat/scripts/import_dashboards

There are no filebeat dashboards for v5.2.1 – there are some for later versions but these are not backwards compatible

My impression is that this is an area that will improve with new releases in the near future (updates to the various beats)
To install from git instead:

git clone https://github.com/elastic/beats.git
cd beats
git checkout tags/v5.2.1
/usr/share/filebeat/scripts/import_dashboards -dir beats/filebeat/module/system/_meta/kibana/
/usr/share/filebeat/scripts/import_dashboards -dir beats/filebeat/module/apache2/_meta/kibana/

Dashboards

Can be imported from the github repository referenced at the top of the article

Changes from alfresco-monitoring:

  • No system indicators – relying on the default beats dashboards
  • All tomcat servers are covered by the dashboard – this allows you to filter by node name in the dashboard (and no need to edit the definition files)
  • No jmx

X-Pack

X-Pack is also useful because it allows you to set up alerts

The puppet file shown will install X-Pack in elasticsearch

To install in kibana
(I have not managed to get this working, possibly due to not configuring authentication, and it breaks kibana)
sudo -u kibana /usr/share/kibana/bin/kibana-plugin install x-pack

Not done

This guide doesn’t show how to configure any form of authentication.

Adding JMX

It should be reasonably straight-forward to add JMX indicators but I’ve not yet done so.

Puppet – the very basics

Installing puppet


apt-get -y install ntp
dpkg -i puppetlabs-release-pc1-xenial.deb
gpg --keyserver pgp.mit.edu --recv-key 7F438280EF8D349F
apt-get update

Puppet agent

apt-get install puppet-agent
Edit /etc/puppetlabs/puppet/puppet.conf to add the [master] section

puppet agent -t -d Note this does apply the changes!

Then you can start the puppet service

export PATH=$PATH:/opt/puppetlabs/bin

Puppet server

apt-get install puppetserver
You will probably also want to install apt-get install puppetdb puppetdb-termini and start the puppetdb service after configuring the PostgreSQL database.

To look at clients:
puppet cert list --all

To validate a new client run puppet cert --sign client

Using puppet locally

You don’t need to use a puppet server

Installing modules

The module page will tell you how to do this e.g.
puppet module install module_name --version version

However it’s probably a better idea to use librarian-puppet and define the modules in the Puppetfile

apt-get install ruby
gem install librarian-puppet
cd /etc/puppetlabs/code/environments/production
librarian-puppet init

Once you have edited your Puppetfile

librarian-puppet install

Running a manifest

Manual run

puppet apply --verbose --detailed-exitcodes /etc/puppetlabs/code/environments/production/manifests/manifest.pp

Managed run

The control of what is install is via the file /etc/puppetlabs/code/environments/production/manifests/site.pp
Best practice indicates that you use small modules to define your installation.
Example site.pp

node default {
   include serverdefault
}
node mynode {
}

You can then check it with puppet agent --noop
Note that --test actually applies the changes.

Running over ssh

Not recommended!
For your ssh user create .ssh/config which contains the following:

Host *
	RemoteForward %h:8140 localhost:8140

you can then set up a tunnel via ssh -N client & (assuming that you can ssh to the client as normal!)

On the client you then need to define the puppet server as localhost in /etc/hosts, then puppet agent --test --server puppetserver
as usual.
Then you can run the agent as usual – don’t forget to start the service.

Config

With the current version you should use an environment specific heira.yaml e.g. /etc/puppetlabs/code/environment/production/hiera.yaml

The recommended approach is to use roles and profiles to define how each node should be configured (out of scope of this post) https://docs.puppet.com/pe/2016.5/r_n_p_full_example.html

Encrypted

See https://github.com/voxpupuli/hiera-eyaml but check your puppet version

gem install hiera-eyaml
puppetserver gem install hiera-eyaml

See https://puppet.com/blog/encrypt-your-data-using-hiera-eyaml

Test using something like puppet apply -e '$var = lookup(config::testprop) notify {$var: }' where config::testprop is defined in your secure.eyaml file

Host specific config

The hieradata for this node is defined in the hierarchy as:
/etc/puppetlabs/code/environments/production/hieradata/nodes/nodename.yaml

Groups of nodes

You can use multiple node names or a regex in your site.pp (remember only one node definition will be matched)

Another alternative is to use facts, either existing or custom, to define locations in your hiera hierarchy

If this is too crude then you can use an ENC

A very skeleton python enc program is given below:

#!/usr/bin/env python

import sys
import re
from yaml import load, dump

n = sys.argv[1]


node = {
    'parameters' : {
        "config::myparam" : 'myvalue'
        }
}

dump(node, sys.stdout,
    default_flow_style=False,
    explicit_start=True,
    indent=10 )

Puppet setup

Add the following section to /etc/puppetlabs/puppet/puppet.conf

[main]
server = puppetmaster
certname = nodename.mydomain
environment = production
runinterval = 1h

Modules

Detailed module writing is out of scope of this post but a quick start is as follows:

puppet module generate wrighting-serverdefault

Then edit manifests/init.pp

Upgrade time?

Inspired by conversations I had at the Alfresco BeeCon I’ve decided to put down some of my thoughts and experiences about going through the upgrade cycle.

It can be a significant amount of work to do an upgrade, even if you have little or no customization, as you need to check that none of the functionality you rely on has changed or broken so it’s not something to be undertaken lightly.

In my experience there are several main factors in helping decide whether it’s time to upgrade:

  • Time since last upgrade
  • Security patches
  • Bug fixes you’ve been waiting for
  • Exciting new features

Using the example of Alfresco Community Edition I find that a good time to start thinking about this is when a new EA version has been released. This means that the previous release is about as stable as it’s going to get and new features are starting to be included. I know many people are a lot more conservative than this so you’ll have to think about what works with your organization.

Time since last release

This is often the deciding factor as you don’t want to get too far behind in the release cycle, otherwise upgrading can become a nightmare. In a previous job I observed an upgrade project that took over a year to complete despite having considerable resources thrown at it and not having any significant new features added – mostly because of the large gap in versions (although there were some poor customization decisions)

Security patches

It’s always important to review security patches and apply them when appropriate but this is generally much easier to do if you’re on a recent version so this is an argument for keeping reasonably up to date.

Bug fixes

Sometimes a bug fix will make it into the core product and you can remove it from your customizations (a good thing), sometimes it’s almost like a new feature but sometimes it will expose a new bug. Generally a positive thing to have.

Exciting new features

Shiny new toys! It’s always tempting to get hold of interesting new features but, unless there’s a really good reason that you want it, it’s usually best to wait for it to stabilize before moving to production but this can be a reason for a more aggressive release cycle.

My process

This is a little more Alfresco specific but the general points apply.

OK so there’s a nice new, significant, version out – for the sake of argument let’s say 5.2 – and I’m on version 5.0 in production so what do I do.

Wait for the SDK to catch up – this is a bit frustrating as sometimes I only have quite a short window to work on Alfresco and if the SDK isn’t working then I’ll have to go and do something else.

I feel that the release should be being built with the SDK but it does tend to lag significantly behind. At the time of writing Alfresco_Community_Edition_201605_GA isn’t supported at all and Alfresco_Community_Edition_201604_GA needs some patching of the SDK while Alfresco_Community_Edition_201606_EA is out. (The SDK 2.2.0 is listed on the release notes page for all of these even though it doesn’t work…)

It’s also a little unclear about what works with what – for example can I run Share 5.1.g(from 201606_EA) with Repo 5.1.g (from 201604_GA)? (which I might be able to make work with the SDK, and I know there are bug fixes I want in Share 5.1.g…) or stick with the Repo 5.1.g/Share 5.1.f combo found in the 201605 GA? (which I can’t build yet)

I should have an existing branch (see below) that is close to working on an earlier EA (or GA) version so in theory I can just update the version number(s) in the pom.xml and rebuild and test. In practice it’s more complicated than that as it’s necessary to go through each customization and check the implications against the code changes in the product (again see below). Sometimes this is easier than others, for example, 5.0.c to 5.0.d seemed like a big change for a minor version increment.

Why create a branch against an EA?

As I mentioned above I’ll try and create a branch against the new EA. Why do this when there’s no chance that I’ll deploy it?

There are a several reasons that I like to do this.

I don’t work with Alfresco all the time so while my thoughts are in that space it’s convenient, and not much slower (see below), to check the customizations against two versions rather than one.

It’s a good time to find and submit bugs – if you find them in the first EA then you’ve got a chance that they’ll be fixed before the GA.

Doing the work against the EA, hopefully, means that when the next GA comes along it won’t be too hard to get ready for a production release.

You get a test instance where you can try out the exciting new features and see if they are good/useful as they sound.

How to check customizations?

This can be a rather time consuming process, and, as it’s not something you do very often, easy to get wrong.

There are a number of things you might need to check (and I’m sure that there are others)

  • Bean definitions
  • Java changes
  • web.xml

While I’m sure everybody has a good set of tests to check your modifications, it’s unlikely that these will be sufficient.

Bean definitions

You might have made changes, for example, to restrict permissions on site creation, and the default values have changed – in this case extra values were added between 4.2 and 5.0, and 5.0 and 5.1

Java changes

Sometimes you might need to override, or extend, existing classes so you need to see if the original class has changed and if you need to take account of these changes

web.xml

CAS configuration is an example of why you might have changed your web.xml and need to update it.

Upgrade Assistant

I’ve started a project https://github.com/wrighting/upgrade-assist to try and help with the more mechanical aspects of checking customizations. I’ve found it helpful and I hope other people will as well – see github for further details.

 

Python, MPI and Sun Grid Engine

Really you need to go here
Do not apt-get install openmpi-bin without reading this first

To see whether your Open MPI installation has been configured to use Sun Grid Engine:

ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
./configure --with-sge
make
make install

Do the straightforward virtualenv setup

sudo apt-get install python-virtualenv
virtualenv example
cd example
source bin/activate
pip install numpy
pip install cython

Installing hdf5 with mpi

Install hdf5 from source to ~/install if necessary – the package should be OK

wget http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.13.tar.gz
tar zxvf hdf5-1.8.13.tar.gz
cd hdf5-1.8.13
export CC=/usr/local/bin/mpicc
mkdir ~/install
./configure --prefix=/home/${USER}/install --enable-parallel --enable-shared
make
#make test
make install
#If you want to...
export PATH=/home/${USER}/install/bin:${PATH}
export LD_LIBRARY_PATH=/home/${USER}/install/lib:${LD_LIBRARY_PATH}
export CC=/usr/local/bin/mpicc
pip install mpi4py
pip install h5py --install-option="--mpi"
#If hdf5 is installed in your home directory add --hdf5=/home/${USER}/install to the --install-option

SGE configuration

http://docs.oracle.com/cd/E19923-01/820-6793-10/ExecutingBatchPrograms.html is quite useful but a number of the commands are wrong…

Before you can run parallel jobs, make sure that you have defined the parallel environment and queue before running the job.
To see queues

qconf -spl

To define a new parallel environment

qconf -ap mpi_pe

To look at the config of a pr

qconf -sp mpi_pe

The value of control_slaves must be TRUE; otherwise, qrsh exits with an error message.

The value of job_is_first_task must be FALSE or the job launcher consumes a slot. In other words, mpirun itself will count as one of the slots and the job will fail, because only n-1 processes will start.
The allocation_rule must be either $fill_up or $round_robin or only one host will be used.

You can look at the remote execution parameters using

qconf -sconf
qconf -aq mpi.q
qconf -mattr queue pe_list "mpi_pe" mpi.q

Checking and running jobs

The program demo.py – note order of imports. This tests the use of h5py in an MPI environment so may be more complex than you need.

from mpi4py import MPI
import h5py

rank = MPI.COMM_WORLD.rank  # The process ID (integer 0-3 for 4-process run)

f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)
#f.atomic = True

dset = f.create_dataset('test', (MPI.COMM_WORLD.Get_size(),), dtype='i')
dset[rank] = rank

grp = f.create_group("subgroup")
dset2 = grp.create_dataset('host',(MPI.COMM_WORLD.Get_size(),), dtype='S10')
dset2[rank] = MPI.Get_processor_name()

f.close()

The command file

source mpi/bin/activate
mpiexec --prefix /usr/local python demo.py

Submitting the job

qsub -cwd -S /bin/bash -pe mpi_pe 2 runq.sh 


Checking mpiexec

mpiexec --prefix /usr/local -n 4 -host oscar,november ~/temp/mpi4py-1.3.1/run.sh
#!/bin/bash
(cd
source mpi/bin/activate
cd ~/temp/mpi4py-1.3.1/
python demo/helloworld.py
)

To avoid extracting mpi4py/demo/helloworld.py

#!/usr/bin/env python
"""
Parallel Hello World
"""

from mpi4py import MPI
import sys

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

sys.stdout.write(
    "Hello, World! I am process %d of %d on %s.n"
    % (rank, size, name))

Troubleshooting

If you get Host key verification failed. make sure that you can ssh to all the nodes configured for the queue (server1 is not the same as server1.example.org)

Use NFSv4 – if you use v3 then you will get the following message:

File locking failed in ADIOI_Set_lock(fd 13,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 2164, length 4
File locking failed in ADIOI_Set_lock(fd 12,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 2160, length 4
[hostname][[54842,1],3][btl_tcp_endpoint.c:459:mca_btl_tcp_endpoint_recv_blocking] recv(17) failed: Connection reset by peer (104)
[hostname][[54842,1],2][btl_tcp_endpoint.c:459:mca_btl_tcp_endpoint_recv_blocking] recv(15) failed: Connection reset by peer (104)

Using

The basic idea is to split the work into chunks and then combine the results. You can see from the demo.py above that if you are using h5py then writing your results out is handled transparently, which is nice.

Scatter(v)/Gather(v)

https://www.cac.cornell.edu/ranger/MPIcc/scatterv.aspx
https://github.com/jbornschein/mpi4py-examples/blob/master/03-scatter-gather
http://stackoverflow.com/questions/12812422/how-can-i-send-part-of-an-array-with-scatter/12815841#12815841

The v variant is used if you cannot break the data into equally sized blocks.

Barrier blocks until all processes have called it.

Getting results from all workers – this will return an array [ worker_data from rank 0, worker_data from rank 1, … ]

worker_data = ....
comm.Barrier()

all_data = comm.gather(worker_data, root = 0)
if (rank == 0):
    #all_data contains the results

BCast/Reduce

BCast sends data from one process to all others
Reduce combines data from all process

Send/Recv

This is probably easier to understand than scatter/gather but you are doing extra work.

There are two obvious strategies available.

Create a results variable of the right dimensions and fill it in as each worker completes:

comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size

#Very crude e.g. if total_size not a multiple of size
total_size = 20
chunk_size = total_size / ((size - 1))

if rank == 0:
    all_data = np.zeros((total_size, 4), dtype='i4')
    num_workers = size - 1
    closed_workers = 0
    while closed_workers < num_workers:
        data = np.zeros((chunk_size, 4), dtype='i4')
        x = MPI.Status()
        comm.Recv(data, source=MPI.ANY_SOURCE,tag = MPI.ANY_TAG, status = x)
        source = x.Get_source()
        tag = x.Get_tag()
        insert_point = ((tag - 1) * chunk_size)
        all_data[insert_point:insert_point+chunk_size] = data
        closed_workers += 1

Wait for each worker to complete in turn and append to the results

AC_TAG = 99 
if rank == 0:
   for i in range(size-1):
       data = np.zeros((chunk_size, 4), dtype='i4')
       comm.Recv(data, source=i+1,tag = AC_TAG)
       if i == 0:
         all_data = data
       else:
         all_data = np.concatenate((all_data,data))

Just as an example here we are expecting the data to be a numpy 2d array but it could be anything and could just be created once with np.empty as the contents will be overwritten.

The key difference to notice is the value of the source and tag parameters to comm.Recv this needs to be matched by the corresponding parameter to comm.Send i.e. tag = rank for the first example, tag = AC_TAG for the second
e.g. comm.Send(ac, dest=0,tag = rank)
Your use of tag and source may vary…

Input data

There are again different ways to do this – either have the rank 0 do all the reading and use Send/Recv to send the data to be processed or let each worker get it’s own data.

Evaluation

MPI.Wtime() can be used to get the elapsed time between two points in a program

Aikau and CMIS

This is still a work in progress but now has a released version and, with a small amount of testing seems work, do please feel free to try it out and feedback either via the blog or as an issue on github.

The post was originally published in order to help with jira 21647.

Following on from my previous post CMIS Dojo store I thought I’d provide a example working with Aikau and the store from github https://github.com/wrighting/dojoCMIS

Note that this is not intended to be a detailed tutorial on working with Aikau, or CMIS, but should be enough to get you going.

As a caveat there are some fairly significant bugs that cause problems with this:

Installation

The code is available as a jar for use with Share but, of course, there’s nothing to stop you using the javascript on its own as part of an Aikau (or dojo) based application.

Just drop the jar into the share/WEB-INF/lib folder or, if you are using maven, you can install using jitpack.io with the following dependency.

        <dependency>
            <groupId>com.github.wrighting</groupId>
            <artifactId>dojoCMIS</artifactId>
            <version>v0.0.1</version>
        </dependency>

Background

A good example for Aikau is Share Page Creation Code

My scenario is as follows:

We have a custom content type used to describe folders containing some work. These folders can be created anywhere however it’s useful to have a page that lists all the custom properties on all the folders of this type. As an added bonus we’ll make these editable as well.

The first thing I’m going to do is write my CMIS query and make sure it returns what I want.
It will end up something like this:
SELECT * FROM wrighting:workFolder join cm:titled as t on cmis:objectId = t.cmis:objectId

It is better to enumerate the fields rather than using * but I’m using * to be concise here.

Simple Configuration

As part of the dojoCMIS jar there’s a handy widget called CmisGridWidget that inspects that data model to fill in the detailed configuration of the column definitions.

You do need to define which columns you want to appear in the data but that is fairly straightforward.

So in Data Dictionary/Share Resources/Pages create a file of type surf:amdpage, with content type application/json. See the file in aikau-example/

You can then access the page at /share/page/hrp/p/name

{
  "widgets": [{
    "name": "wrighting/cmis/CmisGridWidget",
    "timeout" : 10000,
    "config": {
      "query": {
        "path": "/Sites/test-site/documentLibrary/Test"
      },
      "columns" : [ {
            "parentType": "cmis:item",
            "field" : "cmis:name"
          }, {
            "parentType": "cm:titled",
            "field" : "cm:title"
          }, {
            "parentType": "cm:titled",
            "field" : "cm:description"
          }, {
            "parentType": "cmis:item",
            "field" : "cmis:creationDate"
          }
          ]
    }
  }]
}

You’ll notice that this is slightly different from the example/index.html in that it uses CmisGridWidget instead of CmisGrid. (I think it’s easier to debug using example/index.html).
The Widget is the same apart from inheriting from alfresco/core/Core, which is necessary to make it work in Alfresco, and using CmisStoreAlf instead of CmisStore to make the connection.

There are a couple of properties that can be used to improve performance.

If you set configured: true then CmisGrid won’t inspect the data model but will use the columns as configured. If you want to see what a fully configuration looks like then set loggingEnabled: true and the full config will be logged, and can be copied into your CmisGrid definition. Note that if you do this then changes to the data model, e.g. new LIST constraint values, won’t be dynamically updated.

What it does in the jar

Share needs to know about my extensions. I’ve also decided that I’m going to import dgrid because I want to use a table to show my information but the beauty of this approach is that you can use any dojo widget that understands a store so there are a lot to choose from. (I don’t need to tell it about the wrighting or dgrid packages because that’s already in the dojoCMIS jar)

So in the jar these are defined in ./src/main/resources/extension-module/dojoCMIS-extension.xml, if you were doing something similar in your amp you’d add the file src/main/amp/config/alfresco/web-extension/site-data/extensions/example-amd-extension.xml

 <extension>
   <modules>
      <module>
         <id>example Package</id>
         <version>1.0</version>
         <auto-deploy>true</auto-deploy>
         <configurations>
           <config evaluator="string-compare" condition="WebFramework" replace="false">
             <web-framework>
                <dojo-pages>
                  <packages>
                    <package name="example" location="js/example"/>
                  </packages>
               </dojo-pages>
             </web-framework>
           </config>
         </configurations>
      </module>
   </modules>
</extension>

For convenience there’s also a share-config-custom.xml which allows you to specialize the type to surf:amdpage

CmisGrid will introspect the data model, using CMIS, to decide what to do with each field listed in the columns definition.

Configuration

The targetRoot is slightly tricky due to authentication issues.

Prior to 5.0.d you cannot authenticate against the CMIS API without messing about with tickets otherwise you’ll be presented with a popup to type in your username and password (basic auth). (The ticket API seems to have returned in repo 5.2 so it should be possible to use that again – but untested)

(For an example using authentication tickets see this example)

In 5.0.d and later it will work by using the share proxy by default, however updates (POST) are broken – see JIRA referenced above.

You can use all this outside Share (see example/index.html in the git repo) but you’ll need to get all your security configuration sorted out properly.

Which Store should I use?

There are two stores available – wrighting/cmis/store/CmisStore and wrighting/cmis/store/CmisStoreAlf.

Unsurprising CmisStoreAlf is designed to be used within Alfresco as it uses CoreXhr.serviceXhr to make calls.

CmisStore uses jsonp callbacks so is suitable for use outside Alfresco. CmisStore will also work inside Alfresco under certain circumstances e.g. if CSRF protection isn’t relevant.

Detailed Configuration

If you want more control over your configuration then you can create your own widget as shown below.
The jsonModel for the Aikau page (eg in src/main/amp/config/alfresco/site-webscripts/org/example/alfresco/components/page/get-list.get.js) should contain a widget definition along with the usual get-list.desc.xml and get-list.get.html.ftl (<@processJsonModel group="share"/>)

model.jsonModel = {
 widgets : [
  {
    name : "example/work/List",
    config : {}
  }
 ]
}

Now we add the necessary js files in src/main/amp/web/js according to the locations specified in the configuration above.

So I’m going to create a file src/main/amp/web/js/example/work/List.js

Some things to point out:

This is quite a simple example showing only a few columns but it’s fairly easy to extend.

Making the field editable is a matter of defining the cell as:
editor(config, Widget) but look at the dtable docs for more details.

I like to have autoSave enabled so that changes are saved straight away.

To stop the post getting too cluttered I’m not showing the List.css or List.properties files.

There another handy widget called cggh/cmis/ModelMultiSelect that will act the same as Select but provide MultiSelect capabilities.

The List.html will contain

 <div data-dojo-attach-point="wrighting_work_table"></div>

 

define(
        [
                "dojo/_base/array", // array.forEach
                "dojo/_base/declare", "dojo/_base/lang", "dijit/_WidgetBase", "dijit/_TemplatedMixin", "dojo/dom", "dojo/dom-construct",
                "wrighting/cmis/store/CmisStore", "dgrid/OnDemandGrid", "dgrid/Editor", "dijit/form/MultiSelect", "dijit/form/Select",
                "dijit/form/DateTextBox", "dojo/text!./List.html", "alfresco/core/Core"
        ],
        function(array, declare, lang, _Widget, _Templated, dom, domConstruct, CmisStore, dgrid, Editor, MultiSelect, Select, DateTextBox, template, Core) {
            return declare(
                    [
                            _Widget, _Templated, Core
                    ],
                    {
                        cssRequirements : [
                                {
                                    cssFile : "./List.css",
                                    mediaType : "screen"
                                }, {
                                    cssFile : "js/lib/dojo-1.10.4/dojox/grid/resources/claroGrid.css",
                                    mediaType : "screen"
                                },
                                {
                                  cssFile: 'resources/webjars/dgrid/1.1.0/css/dgrid.css'
                                }
                        ],
                        i18nScope : "WorkList",
                        i18nRequirements : [
                            {
                                i18nFile : "./List.properties"
                            }
                        ],
                        templateString : template,
                        buildRendering : function wrighting_work_List__buildRendering() {
                            this.inherited(arguments);
                        },
                        postCreate : function wrighting_work_List_postCreate() {
                            try {

                                var targetRoot;
                                targetRoot = "/alfresco/api/-default-/public/cmis/versions/1.1/browser";

                                this.cmisStore = new CmisStore({
                                    base : targetRoot,
                                    succinct : true
                                });

                                //t.cm:title is the value used
                                this.cmisStore.excludeProperties.push('cm:title');
                                this.cmisStore.excludeProperties.push('cmis:description');
                                this.cmisStore.excludeProperties.push('cm:description');


                                var formatFunction = function(data) {

                                    if (data != null) {
                                        if (typeof data === "undefined" || typeof data.value === "undefined") {
                                            return data;
                                        } else {
                                            return data.value;
                                        }
                                    } else {
                                        return "";
                                    }
                                };

                                var formatLinkFunction = function(text, data) {

                                    if (text != null) {
                                        if (typeof text === "undefined" || typeof text.value === "undefined") {
                                            if (data['alfcmis:nodeRef']) {
                                                return '' + text + '';
                                            } else {
                                                return text;
                                            }

                                        } else {
                                            return text.value;
                                        }
                                    } else {
                                        return "";
                                    }
                                };
                                this.grid = new (declare([dgrid,Editor]))(
                                        {
                                            store : this.cmisStore,
                                            query : {
                                                'statement' : 'SELECT * FROM wrighting:workFolder ' +
                                                    'join cm:titled as t on cmis:objectId = t.cmis:objectId',
                                            },
                                            columns : [
                                                       {
                                                           label : this.message("work.id"),
                                                           field : "cmis:name",
                                                           formatter : formatLinkFunction
                                                       }, 
                                                       {
                                                           label : this.message("work.schedule"),
                                                           field : "p.work:onSchedule",
                                                           autoSave : true,
                                                           editor : "checkbox",
                                                           get : function(rowData) {
                                                               var d1 = rowData["p.work:onSchedule"];
                                                               if (d1 == null) {
                                                                   return false;
                                                               }
                                                               var date1 = d1[0];
                                                               return (date1);
                                                           }
                                                       }, {
                                                           label : this.message("work.title"),
                                                           field : "t.cm:title",
                                                           autoSave : true,
                                                           formatter : formatFunction,
                                                           editor : "text"
                                                       }, {
                                                           field : this.message('work.submitted.date'),
                                                           editor: DateTextBox,
                                                           autoSave : true,
                                                           get : function(rowData) {
                                                               var d1 = rowData["p.work:submittedDate"];
                                                               if (d1 == null) {
                                                                   return null;
                                                               }
                                                               var date1 = new Date(d1[0]);
                                                               return (date1);
                                                           },
                                                           set : function(rowData) {
                                                               var d1 = rowData["p.work:submittedDate"];
                                                               if (d1) {
                                                                   return d1.getTime();
                                                               } else {
                                                                   return null;
                                                               }
                                                           }
                                                       }
                                                       ]
                                        }, this.wrighting_work_table);
                                this.grid.startup();
                            
                            } catch (err) {
                                //console.log(err);
                            }

                        }
                    });
        });

 

OpenLDAP – some installation tips

These are some tips for installing OpenLDAP – you can get away without these but it’s useful stuff to know. This relates to Ubuntu 14.04.

Database configuration

It’s a good idea to configure your database otherwise it, especially the log files, can grow significantly over time if you’re running a lot of operations.


dn: olcDatabase={1}hdb,cn=config
changetype: modify
add: olcDbConfig
olcDbConfig: set_cachesize 0 2097152 0
olcDbConfig: set_lk_max_objects 1500
olcDbConfig: set_lk_max_locks 1500
olcDbConfig: set_lk_max_lockers 1500
olcDbConfig: set_lg_bsize 2097512
olcDbConfig: set_flags DB_LOG_AUTOREMOVE
-
add: olcDbCheckpoint
olcDbCheckpoint: 1024 10

In particular note how the checkpoint is set – without it the logs won’t be removed. There are quite a few references on the internet to setting it as part of the olcDbConfig but that doesn’t work.


ldapmodify -Y EXTERNAL -H ldapi:/// -f dbconfig.ldif

These values will be stored in /var/lib/ldap/DB_CONFIG, and also updated if changed. This should avoid the need to use any of the Berkeley DB utilities.

It’s also possible to change the location of the database and log files but don’t forget that you’ll need to update the apparmor configuration as well.

Java connection problems

If you are having problems connecting over ldaps using java (it’s always work checking with ldapsearch on the command line) then it might be a certificates problem – see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html

You need to copy local_policy.jar and US_export_policy.jar from the download into jre/lib/security e.g.

cp *.jar /usr/lib/jvm/java-8-oracle/jre/lib/security/

You’ll need to do this again after an update to the jre.

Passwords

If you are doing a lot of command line ldap operations it can be helpful to use the -y option with a stored password file

defaults

Don’t forget to edit the value of SLAPD_SERVICES in /etc/default/slapd to contain the full hostname if you are connecting from elsewhere. IP address is recommended if you want to avoid problems with domain name lookups.

memberOf

The memberOf overlay doesn’t seem that reliable in a clustered configuration so it may be necessary to remove and readd from groups in order to have it working.

Mapping groupOfNames to posixGroup

See this serverfault article using this schema
You need to replace the nis schema, so first of all find out the dn of the existing nis schema

slapcat -n 0 | grep 'nis,cn=schema,cn=config'

This will give you something like dn: cn={2}nis,cn=schema,cn=config
Now you need to modify the rfc2307bis.ldif so that you can use ldapmodify. This is a multi-stage process.
First change the schema

dn: cn={2}nis,cn=schema,cn=config
changetype: modify
replace: olcAttributeTypes
....
-
replace: olcObjectClasses
....

It’s still got the original name at this point so let’s change that as well

dn: cn={2}nis,cn=schema,cn=config
changetype: modrdn
newrdn: cn={2}rfc2307bis
deleteoldrdn: 1

Quick check using slapcat but I get an error!

/etc/ldap/slapd.d: line 1: substr index of attribute "memberUid" disallowed
573d83c3 config error processing olcDatabase={1}hdb,cn=config: substr index of attribute "memberUid" disallowed
slapcat: bad configuration file!

so another ldapmodify to fix this – I’ll just remove it for now but it would be better to index member instead.

dn: olcDatabase={1}hdb,cn=config
changetype: modify
delete: olcDbIndex
olcDbIndex: memberUid eq,pres,sub

groupOfNames and posixGroup objectClasses can now co-exist.

On a client machine you will need to add the following to /etc/ldap.conf

nss_schema rfc2307bis
nss_map_attribute uniqueMember member

This isn’t entirely intuitive! You might expect nss_map_attribute memberUid member and whereas that sort of works it doesn’t resolve the dn to the uid of the user and is therefore effectively useless.

Dynamic groups

Make sure you check the N.B.!
I tried this for mapping groupOfNames to posixGroup but it doesn’t work for that use case, however it’s potentially useful so I’m still documenting it.
You need to load the dynlist overlay (with ldapadd)

dn: cn=module{0},cn=config
changetype: modify
add: olcModuleLoad
olcModuleLoad: dynlist

then configure then attr set so that the uid maps to memberUid

dn: olcOverlay=dynlist,olcDatabase={1}hdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcDynamicList
olcOverlay: dynlist
olcDlAttrSet: posixGroup labeledURI memberUid:uid

You then need to add the objectClass labeledURIObject to your posixGroup entry and define the labeledURI e.g.

ldap:///ou=users,dc=yourdomain,dc=com?uid?sub?(objectClass=posixAccount)

Now if you search in ldap for your group it will list the memberUid that you expect.
You can run getent group mygroup and it will report the members of that group correctly.
N.B.For practical purposes this doesn’t actually work see this answer on StackOverflow
This post describing using the rfc2307bis schema for posix groups looks interesting as well.

Running in debug

e.g.

/usr/sbin/slapd -h ldapi:/// -d 16383 -u openldap -g openldap

client set up

Make sure the box can access the LDAP servers

Add the server to inbound security group rules e.g. 636 <ipaddress>/32
apt-get install ldap-utils

Optionally test with

ldapsearch -H ldaps://sso1.mydomain.com:636/ -D “cn=system,ou=users,ou=system,dc=mydomain,dc=com” -W ‘(objectClass=*)’ -b dc=mydomain,dc=com

Set up a person in LDAP by adding objectClasses posixAccount and ldapPublicKey

apt-get install ldap-auth-client

See /etc/default/slapd on the ldap server

ldaps://sso1.mydomain.com/ ldaps://sso2.mydomain.com/

Make local root Database admin – No
LDAP database require login – Yes
cn=system,ou=users,ou=system,dc=mydomain,dc=com
use password

Settings are in /etc/ldap.conf

If you want home directories to be created then add the following to /etc/pam.d/common-session

session required pam_mkhomedir.so

You can checkout autofs-ldap or pam_mount if you’d prefer to mount the directory.(might require ref2307bis)

Now run the following commands

auth-client-config -t nss -p lac_ldap
pam-auth-update

Now test
#su – myldapaccount

Check /var/log/auth.log if problems

If you want to use LDAP groupOfNames as posixGroups see above.

For ssh keys in LDAP – add the sshPublicKey to the ldap record. Multiple keys can be stored. e.g. using openssh-lpk_openldap

Make sure ssh server is correctly configured

dpkg-reconfigure openssh-server

Add the following to /etc/ssh/sshd_config – both are needed, then create the file using the contents below

Restart the ssh service after doing both steps and check that it has restarted (pid given in the start message)


AuthorizedKeysCommand /etc/ssh/ldap-keys.sh
AuthorizedKeysCommandUser nobody

Contents of /etc/ssh/ldap-keys.sh
You can restrict access by modifying the ldapsearch command
Access can also be restricted by using the host field in the ldap user record but that’s more complicated

script must only be writeable by root


#!/bin/bash

uri=`grep uri /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`
binddn=`grep binddn /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`
bindpw=`grep bindpw /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`
base=`grep base /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`

TMPFILE=/tmp/$$

for u in `grep uri /etc/ldap.conf| egrep -v ^# | awk ‘{for (i=2; i<=NF; i++) print $i}’` do ldapsearch -H ${u} \ -w “${bindpw}” -D “${binddn}” \ -b “${base}” \ ‘(&(objectClass=posixAccount)(uid='”$1″‘))’ \ ‘sshPublicKey’ > $TMPFILE
RESULT=$?
grep sshPublicKey:: $TMPFILE > /dev/null
if [ $? -eq 0 ]
then
sed -n ‘/^ /{H;d};/sshPublicKey::/x;$g;s/\n *//g;s/sshPublicKey:: //gp’ $TMPFILE | base64 -d
else
sed -n ‘/^ /{H;d};/sshPublicKey:/x;$g;s/\n *//g;s/sshPublicKey: //gp’ $TMPFILE
fi
if [ $RESULT -eq 0 ]
then
exit
fi
done
rm $TMPFILE

Command reference

Search
ldapsearch -H ldapi:/// -x -y /root/.ldappw -D 'cn=admin,dc=mydomain,dc=com' -b 'dc=mydomain,dc=com' "(cn=usersAdmin)"

Note that the syntax of the LDIF files for the next two commands is somewhat different

Adding entries
ldapadd -H ldapi:/// -x -y ~/.ldappw -D 'cn=admin,dc=mydomain,dc=com' -f myfile.ldif

Making changes
ldapmodify -Y EXTERNAL -H ldapi:/// -f myfile.ldif

Recursively removing a sub-tree
ldapdelete -H ldapi:/// -x -y ~/.ldappw -D "cn=admin,dc=mydomain,dc=com" -r "ou=tobedeleted,dc=mydomain,dc=com"

Checking databases used
ldapsearch -H ldapi:// -Y EXTERNAL -b "cn=config" -LLL -Q "olcDatabase=*" dn

A dojo store for the cmis browser binding

First of all why am I doing this?

Dojo is a popular javascript library which is used extensively and, of particular interest, is coming to more prominence within Alfresco.

dojo.store is based on HTML5/W3C’s IndexedDB object store API
and is useful because stores can be used to provide the data access methods for a wide range of dojo/dijit widgets and are especially useful to easily visualize data in any number of ways.

CMIS is standard used to access content stored in a repository, such as Alfresco and, particularly with the advent of the browser binding in CMIS 1.1, it makes it possible to manage information within that repository using a series of HTTP requests.

While the CMIS API is relatively straightforward there are some wrinkles, particularly with respect to cross-origin requests, so it seems to make sense, allied to the advantages of having the API available as a dojo store, to provide a wrapper for the simple actions at least.

So now I’ve explained my motivation on with a brief description and some basic examples: (This is available in example.html in the git repository)

The first thing to do is to create a store:

var targetRoot = 
       "https://localhost/alfresco/api/-default-/public/cmis/versions/1.1/browser";

this.cmisStore = new CmisStore({
                                 base: targetRoot,
                                 succinct: true
                                });

The first thing to notice is the value of base – note that there is no /root at the end – this will be automatically appended by the store (use the root option if you need to change this)

Next we need to attach it to something I’m going to use dijit.Tree here.

We’ll need to provide a mapping to the ObjectStoreModel – I’ll also wrap it in Observable so we can see any changes.(this doesn’t work brilliantly for dijit.Tree as it puts in a duplicate node at the root level as well as it the right place – I haven’t worked out why yet – probably something to do with not knowing the parent)

You’ll see the query parameter which is used to determine what to fetch – this could be a path to a folder

We also have to provide a function to determine whether it’s a leaf node – contrary to the documentation this function isn’t called if it’s in the CmisStore (neither does getLabel)

    this.treeStore = new Observable(this.cmisStore);
    // Create the model
    var model = new ObjectStoreModel({
        store: this.treeStore,
        query: { path: this.path, cmisselector: 'object'},
        labelAttr: "cmis:name",
        mayHaveChildren : function(data) {
                    if (data['cmis:baseTypeId'] == 'cmis:folder') {
                        return true;
                    } else {
                        return false;
                    }
        }
    });

Now that we’ve done that we can create our Tree.

   // Create the Tree.
    this.tree = new Tree({
        model: model
    });
    this.tree.placeAt("foldertree");
    this.tree.startup();

That’s it – you’ll have a nice tree representation of your CMIS repository – it’s as easy to use other widgets like one of the data grids – plenty of scope to get creative! (e.g. https://github.com/speich/remoteFileExplorer)

Making changes

Here you can see some code to add a folder.
First you fetch the parent folder – this can be done either by path or objectId. If the parameter contains a / or the { usePath: true} is set as the second options parameter then it’s a path otherwise it’s a objectId.

This object is then set as parent in the options parameter of the store.add call as shown in the example.

Finally once the folder has been added the grid can be refreshed to show the new folder.

You’ll see that there’s a formatter function to take account whether the succinct version of the CMIS response is being used – note this only handles simple values.

lang.hitch is used so that the function called inside the “then” has access to the same context.

addFolder: function addFolder() {
      //Get the test folder to be the parent of the new folder
      this.cmisStore.get(this.testFolder).then(lang.hitch(this, function(result) {

           this.cmisStore.add({
                               'cmis:objectTypeId': 'cmis:folder', 
                               'cmis:name': this.tempFolder
                              }, { 
                                parent: result 
                                  }).then(
                               lang.hitch(this,function( result) {
//Do something
                                }));
                     }), function (error) {
                       console.log(error);
                     }, function (progress) {
                        console.log(progress);
                     });
                },

Updating

Making changes to your content is very easy – the store handles it – so all you’ve got to do is make sure your widget handles the data correctly.

One way you might like to edit your data is via dgrid – this makes it extremely straightforward to add an editor to a column e.g.

  editor({
       label : ("pub.title"),
       field : "t.cm:title",
       formatter : formatFunction,
       editor : "text",
       autoSave : true
        })

One thing you will notice is that my field is called t.cm:title this is because the query on my OnDemandGrid is defined like this:

  query : {
    'statement' : 'SELECT * FROM cmis:folder 
                           join cm:titled as t on cmis:objectId = t.cmis:objectId',
          }

The code inside the store put method will strip off the leading alias i.e. upto and including the .

You need to be aware that not all the properties can be updated via CMIS – the store has a couple of ways of handling this working by either using a list of excluded properties or allowed properties. This is determined by the putExclude property which is set to true or false.

If you are working with custom properties then you may need to modify the list – this can be done by modifying the excludeProperties or allowedProperties members of the store e.g.

  cmisStore.excludeProperties.push('cm:title');

Note this works on the absolute property name, not the namespace trimmed value.

The store will post back the entire object, not just the changed properties so you either need to make sure that the value is valid or exclude the property.

Error handling isn’t covered here and will depend on which widget you’re using.

Dates

For handling dates you need to convert to/from the CMIS date (secs since epoch) to a javascript Date object as part of the editor definition.
Use dijit/form/DateTextBox as your editor widget.

 editor({
     field : 'p.myns:myDate',
     autoSave : true,
     get: function(rowData) {
         var d1 = rowData["p.myns:myDate"];
         if (d1 == null) {
             return null;
         }
         var date1 = new Date(d1[0]);
         return(date1);
     },
     set: function (rowData) {
         var d1 = rowData["p.myns:myDate"];
         if (d1) {
             return d1.getTime();
         } else {
             return null;
         }
     } 
 }, DateTextBox),

The CMIS server will error if it is sent an empty string as a datetime value so in order to avoid this the CmisStore will not attempt to send null values.

Select

For a simple select just use a dijit/form/Select widget as the basis for you editor and set the options using the editorArgs e.g.

editor({
 label : ("pub.type"),
 field : "p.cgghPub:type",
 editorArgs : {
    options : [
                 { label : "one", value : "1"}
              ]
              }
}, Select)

MultiSelect

MultiSelect doesn’t have the luxury of using options in the constructor – the easiest way I found is to create your own widget and use that e.g.

declare("CategoryMultiSelect", MultiSelect, {
                                    size : 3,
                                    postCreate : function() {
                                        domConstruct.create('option', {
                                            innerHTML : 'cat1',
                                            value : 'cat1'
                                        }, this.domNode);
                                        domConstruct.create('option', {
                                            innerHTML : 'cat2',
                                            value : 'cat2'
                                        }, this.domNode);
                                        domConstruct.create('option', {
                                            innerHTML : 'cat3',
                                            value : 'cat3'
                                        }, this.domNode);
                                        this.inherited(arguments);
                                    }
                                });

Other information:

The most useful settings for query are either a string, representing a path or object id, or an object containing either/both of the members path and statement where statement is a CMIS query e.g. SELECT * FROM cmis:document.

The store uses deferred functions to manipulate the query reponse so that either the succinctProperties or the properties object for each item are returned – if you’re not using succinct (the default) then make sure you get the value for your property

The response information is retrieved by making a second call to get the transaction information.

Add actual makes three calls to the server – add, retrieve the transaction and then fetch the new item – although it’s not documented it seems that Tree at least expects the response from add to be the created item.

The put method only allows you to update a limited number of properties (note cmis:description is not the same as cm:description) and returns the CMIS response rather than the modified object

Remove makes the second call despite the fact that’s it’s not in a transaction – this allows response handling to happen.

For documentation there is the Dojo standard – I did also consider using JSDoc but decided to stick with Dojo format

There are some tests written with Intern however they are fairly limited – not least because there’s a very simple pseudo CMIS server used.

Getting started with Hadoop 2.3.0

Googling will get you instructions for the old version so here are some notes for 2.3.0

Note that there appears to be quite a difference with version 2 although it is supposed to be mostly compatible

You should read the whole post before charging off and trying any of this stuff as you might not want to start at the beginning!

References
http://codesfusion.blogspot.co.uk/2013/10/setup-hadoop-2x-220-on-ubuntu.html
which has a script at: https://github.com/ericduq/hadoop-scripts – this is good but needs changes around the downloading of the hadoop file – be careful if you run it more than once

Changes from the blog (not necessary if using the script)

in ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

sudo ssh hduser@localhost -i /home/hduser/.ssh/id_rsa

If start-dfs.sh gives errors
try

hdfs getconf -namenodes

If you see the following:

OpenJDK 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
14/03/13 15:27:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Try the following in /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

(Although this should work in ~/.bashrc it appears not to)

Using files in HDFS

#Create a directory and copy a file to and fro
hadoop fs -mkdir -p /user/hduser
hadoop fs -copyFromLocal someFile.txt someFile.txt

hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt

#Get a directory listing of the user’s home directory in HDFS

hadoop fs –ls

#Display the contents of the HDFS file /user/hduser/someFile.txt
hadoop fs –cat /user/hduser/someFile.txt

#Delete the file
hadoop fs –rm someFile.txt

Doing something

Context is everything so what am I trying to do?

I am working with VCF (Variant Call Format) files which are used to hold genetic information – I won’t go into details as it’s not very relevant here.

VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

Hadoop itself is written in Java so the natural choice for interacting with it is to use a Java client and while there is a VCF reader in GATK (see http://plindenbaum.blogspot.fr/2012/11/readingwriting-vcf-file-with-gatk-api.html) it is more common to use python.

Tutorials in Data-Intensive Computing gives some great, if incomplete at this time, advice on using Hadoop Streaming together with pyvcf (there’s some nice stuff on using Hadoop on a more traditional cluster as well which is an alternative to the methods described above)

Pydoop provides an alternative to Streaming via hadoop pipes but seems not to have quite caught up with the current state of play.

Another possibility is to use Jython to translate the python into java see here

One nice thing about using Streaming is that it’s fairly easy to do a comparison between a Hadoop implementation and a traditional implementation.

So here are some numbers (using the parsevcf.py from the Data-Intensive Computing tutorial)

Create the header file

parsevcf.py -b data.vcf > header.txt

Pipes

date;$(which python) $PWD/parsevcf.py -m $PWD/header.txt,0.30 < data.vcf |
$(which python) $PWD/parsevcf.py -r > out;date

Hadoop (Single node on the same computer)

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -mapper "$(which python) $PWD/parsevcf.py -m $PWD/header.txt,0.30" -reducer "$(which python) $PWD/parsevcf.py -r" -input $PWD/vcfparse/data.vcf -output $PWD/vcfparse/output

The output files contain the same data however the rows are held in a different order.

When running on a cluster we’ll need to use the -combiner option and -file to ship the scripts to the cluster.

The MapReduce framework orders the keys in the output, if you’re doing this in Java you will get an iterator for each key but, obviously, not when you’re streaming.

Running locally with a 1020M test input file seems to indicate a good speed up (~2 mins vs ~6 mins) so now I’ve tried it with a relatively small file it’s time to scale up a bit to a 12G file and moving to an 8 processor VM (slower disk) – not an ideal test machine but it’s what I’ve got easily to hand and is better than using my desktop where there are other things going on.

Results

You can look at some basic statistics via http://localhost:8088/

Note that it does take a while to copy the file to/from the Hadoop file system which is not included here

Number of splits: 93

method Map Jobs Reduce Jobs Time
pipes N/A N/A 2 hours 46 mins 17 secs
Single Node Default Default 49 mins 29 secs
Single Node 4 4  1 hr 14 mins 51 secs
Single Node 6 2 1 hr 6 secs
Single Node 2 6 1 hr 13 mins 25 secs

An example using streaming, Map/Reduce with a tab based input file

Assuming you’ve got everything set up

Start your engines

If necessary

start-dfs.sh
start-yarn.sh

dfs is the file system

yarn is the job scheduler

Copy your input file to the dfs


hadoop fs -mkdir -p /user/hduser
hadoop fs -copyFromLocal someFile.txt data
hadoop fs -ls -h

The task

The aim is to calculate the variant density using a particular window on the genome.

This is a slightly more complex version of the classic “hello world” of hadoop – the word count.

Input data

The input file is a tab delimited file containing one line for each variant 28G, over 95,000,000 lines.

We are interested in the chromosome, position and whether the PASS filter has been applied.

The program

First we need to work out which column contains the PASS filter – awk is quite helpful to check this

head -1 data | awk -F\t '{print $18}'

(Remember awk counts from 1 not 0)

The mapper

For the mapper we will build a key/value pair for each line – the key is a combination of the chromosome and bucket (1kb window) and the value a count and whether it passes/fails (we don’t really need the count…)

#!/usr/bin/env python

import sys

for line in sys.stdin:
    cells = line.split('t')
    chrom = cells[0]
    pos = cells[1]
    pass_filter = None
    if (cells[17] == "True"):
      pass_filter = True
    if (cells[17] == "False"):
      pass_filter = False
    if (pass_filter is not None):
      bucket = int(int(pos)/1000)
      point = (bucket + 1) * (1000 / 2)
      print ("%s-%dt1-%s" % (chrom, point, str(pass_filter)))

 

You can easily test this on the command line using pipes e.g.

head -5 data | python mapper.py

The reducer

The reducer takes the output from the mapper and merges it according to the key

Test again using pipes

 

import sys

last_key = None
running_total = 0
passes = 0

for input_line in sys.stdin:
    input_line = input_line.strip()
    this_key, value = input_line.split("t", 1)
    variant, pass_filter = value.split('-')
    if last_key == this_key:
        running_total += int(variant)
        if (pass_filter == "True"):
          passes = passes + 1
    else:
        if last_key:
            chrom, pos = last_key.split('-')
            print( "%st%st%dt%d" % (chrom, pos, running_total, passes) )
        running_total = int(variant)
        if (pass_filter == "True"):
          passes = 1
        else:
          passes = 0
        last_key = this_key

if last_key == this_key:
    chrom, pos = last_key.split('-')
    print( "%st%st%dt%d" % (chrom, pos, running_total, passes) )

 

 

head -5 data | python mapper.py | python reducer.py

 

Running

Note the mapper and reducer scripts are on the local file system and the input and output files on the hfs

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -mapper "$(which python) $PWD/mapper.py" -reducer "$(which python) $PWD/reducer.py" -input data -output output

Copy the output back to the local file system

hadoop fs -copyToLocal output

If you want to sort the output then the following command does a nice job

sort -V output/part-00000

Don’t forget to clean up after yourself


hadoop fs -rm -r output
hadoop fs -rm data

Now we’ve got the job running we can look at start to make it go faster. The first thing to try is to increase the number of tasks – I’m using an 8 processor VM so I’ll try 4 of each to start with (the property names for doing this have changed)

hadoop fs -rm -r output
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -D mapreduce.job.reduces=4 -D mapreduce.job.maps=4 -mapper "$(which python) $PWD/mapper.py" -reducer "$(which python) $PWD/reducer.py" -input data -output output

Looking at the output I can see

INFO mapreduce.JobSubmitter: number of splits:224

This seems to indicate that I could usefully go up to 224(*2) jobs if I had enough cores free

Results

method Map Jobs Reduce Jobs Time
pipes N/A N/A 31 mins 31 secs
Single Node Default Default 39 mins 6 secs
Single Node 4 4 28 mins 18 secs
Single Node 5 3 33 mins 1 secs
Single Node 3 5 31 mins 8 secs
Single Node 5 5 28 mins 40 secs
Single Node 6 2 49 mins 55 secs

Conclusion

From these brief experiments it looks like there is no point using the Map/Reduce framework for trivial tasks even on large files.

A more positive result is that it looks like there may well be some advantage for more complex tasks and this merits some further investigation as I’ve only scratched the surface here.

Some things to look at are:

The output won’t be in the same order as the input so if this is important Hadoop streaming has Comparator and Partitioner to help sort results from the map to the reduce
You can decide to split the map outputs based on certain key fields, not the whole keys see the Hadoop Partioner Class
See docs for 1.2.1 here

How do I generate output files with gzip format?

Instead of plain text files, you can generate gzip files as your generated output. Pass ‘-D mapred.output.compress=true -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec’ as option to your streaming job.

How to use a compressed input file