Simple Kibana monitoring (for Alfresco)

This post is inspired by and there’s a lot of useful info in there.

The aim of this post is to allow you to quickly get up and running and monitoring your logs.

I’m using puppet to install, even though we don’t have a puppet master, as there are modules provided by Elastic Search that make it easy to install and configure the infrastructure.

If you’re not sure how to use puppet look at my post Puppet – the very basics

Files to go with this post are available on github

I’m running on Ubuntu 16.04 and at the end have

  • elasticsearch 5.2.1
  • logstash 1.5.2
  • kibana 5.2.1

The kibana instance will be running on port 5601

Elastic Search

Elastic Search puppet module
Logstash puppet module

puppet module install elastic-elasticsearch --version 5.0.0
puppet module install elastic-logstash --version 5.0.4

The manifest file

class { 'elasticsearch':
java_install => true,
manage_repo => true,
repo_version => '5.x',

elasticsearch::instance { 'es-01': }
elasticsearch::plugin { 'x-pack': instances => 'es-01' }

include logstash

# You must provide a valid pipeline configuration for the service to start.
logstash::configfile { 'my_ls_config':
content => template('wrighting-logstash/logstash_conf.erb'),

logstash::plugin { 'logstash-input-beats': }
logstash::plugin { 'logstash-filter-grok': }
logstash::plugin { 'logstash-filter-mutate': }

Configuration – server

puppet apply --verbose --detailed-exitcodes /etc/puppetlabs/code/environments/production/manifests/elk.pp


Configuration – client

This is a fairly big change over the alfresco-monitoring configuration as it uses beats to publish the information to the logstash instance running on the server.

For simplicity I’m not using redis.

Links for more information or just use the code below

wget -qO - | sudo apt-key add -
echo "deb stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
apt-get update
apt-get install filebeat metricbeat

Partly for convenience I choose to install both beats on the ELK server and connect them directly to elasticsearch (the default) before installing elsewhere. This has the advantage of automatically loading the elasticsearch template file

You should normally disable the elasticsearch output and enable the logstash output if you are sending tomcat files


curl -L -O
sudo dpkg -i filebeat-5.2.1-amd64.deb
Note that this configuration implies the change made to the tomcat access log configuration in server.xml

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"
prefix="access-" suffix=".log"
pattern='%a %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i" %D "%I"'

Edit /etc/filebeat/filebeat.yml

 # Each - is a prospector. Most options can be set at the prospector level, so
 # you can use different prospectors for various configurations.
 # Below are the prospector specific configurations.
 - input_type: log
 # Paths that should be crawled and fetched. Glob based paths.
 - /var/log/tomcat7/access-*.log
 tags: [ "TomcatAccessLog" ]

- input_type: log
 # Paths that should be crawled and fetched. Glob based paths.
 - /var/log/tomcat7/alfresco.log
 tags: [ "alfrescoLog" ]

- input_type: log
 # Paths that should be crawled and fetched. Glob based paths.
 - /var/log/tomcat7/share.log
 tags: [ "shareLog" ]

 hosts: [""]

Don’t forget to start the service!

If you are using the filebeat apache2 module then check your error.log as you may need to configure the access for the apache2 status module

Metric Beat

Check the documentation but it’s probably OK to mostly leave the defaults

Port forwarding

If you need to set up port forwarding the following will do it.
Edit .ssh/config

Host my.filebeats.client
RemoteForward my.filebeats.client:5044 localhost:5044

ssh -N my.filebeats.client &
Note you will need to restart if you change/restart logstash

Logstash config

Look at the logstash_conf.erb file.

Changes from alfresco config

  • You will need to change [type] to [tags]
  • multi-line is part of the input, not the filters – note this could be done in the filebeat config
  • jmx filters removed as I’m using community edition
  • system filters removed as I’m using the metricbeat supplied configuration

Exploring ElasticSearch
View the indexes
curl 'localhost:9200/_cat/indices?v'
Look at some content – defaults to 10 results
curl 'localhost:9200/filebeat-2017.02.23/_search?q=*&pretty'
Look at some content with a query
curl -XPOST 'localhost:9200/filebeat-2017.02.23/_search?pretty' -d@query.json

"query": { "match": { "tags": "TomcatAccessLog"} },
"size": 10,
"sort": { "@timestamp": { "order": "desc"}}


Set up

puppet module install cristifalcas-kibana --version 5.0.1
This gives you a kibana install running on http://localhost:5601

Note that this is Kibana version 5

If you are having trouble with fields not showing up try – Management -> Index Patterns -> refresh and/or reload the templates

curl -XPUT 'http://localhost:9200/_template/metricbeat' -d@/etc/metricbeat/metricbeat.template.json 
curl -XPUT 'http://localhost:9200/_template/filebeat' -d@/etc/filebeat/filebeat.template.json 

A good place to start is to load the Default beats dashboards


There are no filebeat dashboards for v5.2.1 – there are some for later versions but these are not backwards compatible

My impression is that this is an area that will improve with new releases in the near future (updates to the various beats)
To install from git instead:

git clone
cd beats
git checkout tags/v5.2.1
/usr/share/filebeat/scripts/import_dashboards -dir beats/filebeat/module/system/_meta/kibana/
/usr/share/filebeat/scripts/import_dashboards -dir beats/filebeat/module/apache2/_meta/kibana/


Can be imported from the github repository referenced at the top of the article

Changes from alfresco-monitoring:

  • No system indicators – relying on the default beats dashboards
  • All tomcat servers are covered by the dashboard – this allows you to filter by node name in the dashboard (and no need to edit the definition files)
  • No jmx


X-Pack is also useful because it allows you to set up alerts

The puppet file shown will install X-Pack in elasticsearch

To install in kibana
(I have not managed to get this working, possibly due to not configuring authentication, and it breaks kibana)
sudo -u kibana /usr/share/kibana/bin/kibana-plugin install x-pack

Not done

This guide doesn’t show how to configure any form of authentication.

Adding JMX

It should be reasonably straight-forward to add JMX indicators but I’ve not yet done so.

Puppet – the very basics

Installing puppet

apt-get -y install ntp
dpkg -i puppetlabs-release-pc1-xenial.deb
gpg --keyserver --recv-key 7F438280EF8D349F
apt-get update

Puppet agent

apt-get install puppet-agent
Edit /etc/puppetlabs/puppet/puppet.conf to add the [master] section

puppet agent -t -d Note this does apply the changes!

Then you can start the puppet service

export PATH=$PATH:/opt/puppetlabs/bin

Puppet server

apt-get install puppetserver
You will probably also want to install apt-get install puppetdb puppetdb-termini and start the puppetdb service after configuring the PostgreSQL database.

To look at clients:
puppet cert list --all

To validate a new client run puppet cert --sign client

Using puppet locally

You don’t need to use a puppet server

Installing modules

The module page will tell you how to do this e.g.
puppet module install module_name --version version

However it’s probably a better idea to use librarian-puppet and define the modules in the Puppetfile

apt-get install ruby
gem install librarian-puppet
cd /etc/puppetlabs/code/environments/production
librarian-puppet init

Once you have edited your Puppetfile

librarian-puppet install

Running a manifest

Manual run

puppet apply --verbose --detailed-exitcodes /etc/puppetlabs/code/environments/production/manifests/manifest.pp

Managed run

The control of what is install is via the file /etc/puppetlabs/code/environments/production/manifests/site.pp
Best practice indicates that you use small modules to define your installation.
Example site.pp

node default {
   include serverdefault
node mynode {

You can then check it with puppet agent --noop
Note that --test actually applies the changes.

Running over ssh

Not recommended!
For your ssh user create .ssh/config which contains the following:

Host *
	RemoteForward %h:8140 localhost:8140

you can then set up a tunnel via ssh -N client & (assuming that you can ssh to the client as normal!)

On the client you then need to define the puppet server as localhost in /etc/hosts, then puppet agent --test --server puppetserver
as usual.
Then you can run the agent as usual – don’t forget to start the service.


With the current version you should use an environment specific heira.yaml e.g. /etc/puppetlabs/code/environment/production/hiera.yaml

The recommended approach is to use roles and profiles to define how each node should be configured (out of scope of this post)


See but check your puppet version

gem install hiera-eyaml
puppetserver gem install hiera-eyaml


Test using something like puppet apply -e '$var = lookup(config::testprop) notify {$var: }' where config::testprop is defined in your secure.eyaml file

Host specific config

The hieradata for this node is defined in the hierarchy as:

Groups of nodes

You can use multiple node names or a regex in your site.pp (remember only one node definition will be matched)

Another alternative is to use facts, either existing or custom, to define locations in your hiera hierarchy

If this is too crude then you can use an ENC

A very skeleton python enc program is given below:

#!/usr/bin/env python

import sys
import re
from yaml import load, dump

n = sys.argv[1]

node = {
    'parameters' : {
        "config::myparam" : 'myvalue'

dump(node, sys.stdout,
    indent=10 )

Puppet setup

Add the following section to /etc/puppetlabs/puppet/puppet.conf

server = puppetmaster
certname = nodename.mydomain
environment = production
runinterval = 1h


Detailed module writing is out of scope of this post but a quick start is as follows:

puppet module generate wrighting-serverdefault

Then edit manifests/init.pp

Upgrade time?

Inspired by conversations I had at the Alfresco BeeCon I’ve decided to put down some of my thoughts and experiences about going through the upgrade cycle.

It can be a significant amount of work to do an upgrade, even if you have little or no customization, as you need to check that none of the functionality you rely on has changed or broken so it’s not something to be undertaken lightly.

In my experience there are several main factors in helping decide whether it’s time to upgrade:

  • Time since last upgrade
  • Security patches
  • Bug fixes you’ve been waiting for
  • Exciting new features

Using the example of Alfresco Community Edition I find that a good time to start thinking about this is when a new EA version has been released. This means that the previous release is about as stable as it’s going to get and new features are starting to be included. I know many people are a lot more conservative than this so you’ll have to think about what works with your organization.

Time since last release

This is often the deciding factor as you don’t want to get too far behind in the release cycle, otherwise upgrading can become a nightmare. In a previous job I observed an upgrade project that took over a year to complete despite having considerable resources thrown at it and not having any significant new features added – mostly because of the large gap in versions (although there were some poor customization decisions)

Security patches

It’s always important to review security patches and apply them when appropriate but this is generally much easier to do if you’re on a recent version so this is an argument for keeping reasonably up to date.

Bug fixes

Sometimes a bug fix will make it into the core product and you can remove it from your customizations (a good thing), sometimes it’s almost like a new feature but sometimes it will expose a new bug. Generally a positive thing to have.

Exciting new features

Shiny new toys! It’s always tempting to get hold of interesting new features but, unless there’s a really good reason that you want it, it’s usually best to wait for it to stabilize before moving to production but this can be a reason for a more aggressive release cycle.

My process

This is a little more Alfresco specific but the general points apply.

OK so there’s a nice new, significant, version out – for the sake of argument let’s say 5.2 – and I’m on version 5.0 in production so what do I do.

Wait for the SDK to catch up – this is a bit frustrating as sometimes I only have quite a short window to work on Alfresco and if the SDK isn’t working then I’ll have to go and do something else.

I feel that the release should be being built with the SDK but it does tend to lag significantly behind. At the time of writing Alfresco_Community_Edition_201605_GA isn’t supported at all and Alfresco_Community_Edition_201604_GA needs some patching of the SDK while Alfresco_Community_Edition_201606_EA is out. (The SDK 2.2.0 is listed on the release notes page for all of these even though it doesn’t work…)

It’s also a little unclear about what works with what – for example can I run Share 5.1.g(from 201606_EA) with Repo 5.1.g (from 201604_GA)? (which I might be able to make work with the SDK, and I know there are bug fixes I want in Share 5.1.g…) or stick with the Repo 5.1.g/Share 5.1.f combo found in the 201605 GA? (which I can’t build yet)

I should have an existing branch (see below) that is close to working on an earlier EA (or GA) version so in theory I can just update the version number(s) in the pom.xml and rebuild and test. In practice it’s more complicated than that as it’s necessary to go through each customization and check the implications against the code changes in the product (again see below). Sometimes this is easier than others, for example, 5.0.c to 5.0.d seemed like a big change for a minor version increment.

Why create a branch against an EA?

As I mentioned above I’ll try and create a branch against the new EA. Why do this when there’s no chance that I’ll deploy it?

There are a several reasons that I like to do this.

I don’t work with Alfresco all the time so while my thoughts are in that space it’s convenient, and not much slower (see below), to check the customizations against two versions rather than one.

It’s a good time to find and submit bugs – if you find them in the first EA then you’ve got a chance that they’ll be fixed before the GA.

Doing the work against the EA, hopefully, means that when the next GA comes along it won’t be too hard to get ready for a production release.

You get a test instance where you can try out the exciting new features and see if they are good/useful as they sound.

How to check customizations?

This can be a rather time consuming process, and, as it’s not something you do very often, easy to get wrong.

There are a number of things you might need to check (and I’m sure that there are others)

  • Bean definitions
  • Java changes
  • web.xml

While I’m sure everybody has a good set of tests to check your modifications, it’s unlikely that these will be sufficient.

Bean definitions

You might have made changes, for example, to restrict permissions on site creation, and the default values have changed – in this case extra values were added between 4.2 and 5.0, and 5.0 and 5.1

Java changes

Sometimes you might need to override, or extend, existing classes so you need to see if the original class has changed and if you need to take account of these changes


CAS configuration is an example of why you might have changed your web.xml and need to update it.

Upgrade Assistant

I’ve started a project to try and help with the more mechanical aspects of checking customizations. I’ve found it helpful and I hope other people will as well – see github for further details.


Python, MPI and Sun Grid Engine

Really you need to go here
Do not apt-get install openmpi-bin without reading this first

To see whether your Open MPI installation has been configured to use Sun Grid Engine:

ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
./configure --with-sge
make install

Do the straightforward virtualenv setup

sudo apt-get install python-virtualenv
virtualenv example
cd example
source bin/activate
pip install numpy
pip install cython

Installing hdf5 with mpi

Install hdf5 from source to ~/install if necessary – the package should be OK

tar zxvf hdf5-1.8.13.tar.gz
cd hdf5-1.8.13
export CC=/usr/local/bin/mpicc
mkdir ~/install
./configure --prefix=/home/${USER}/install --enable-parallel --enable-shared
#make test
make install
#If you want to...
export PATH=/home/${USER}/install/bin:${PATH}
export LD_LIBRARY_PATH=/home/${USER}/install/lib:${LD_LIBRARY_PATH}
export CC=/usr/local/bin/mpicc
pip install mpi4py
pip install h5py --install-option="--mpi"
#If hdf5 is installed in your home directory add --hdf5=/home/${USER}/install to the --install-option

SGE configuration is quite useful but a number of the commands are wrong…

Before you can run parallel jobs, make sure that you have defined the parallel environment and queue before running the job.
To see queues

qconf -spl

To define a new parallel environment

qconf -ap mpi_pe

To look at the config of a pr

qconf -sp mpi_pe

The value of control_slaves must be TRUE; otherwise, qrsh exits with an error message.

The value of job_is_first_task must be FALSE or the job launcher consumes a slot. In other words, mpirun itself will count as one of the slots and the job will fail, because only n-1 processes will start.
The allocation_rule must be either $fill_up or $round_robin or only one host will be used.

You can look at the remote execution parameters using

qconf -sconf
qconf -aq mpi.q
qconf -mattr queue pe_list "mpi_pe" mpi.q

Checking and running jobs

The program – note order of imports. This tests the use of h5py in an MPI environment so may be more complex than you need.

from mpi4py import MPI
import h5py

rank = MPI.COMM_WORLD.rank  # The process ID (integer 0-3 for 4-process run)

f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)
#f.atomic = True

dset = f.create_dataset('test', (MPI.COMM_WORLD.Get_size(),), dtype='i')
dset[rank] = rank

grp = f.create_group("subgroup")
dset2 = grp.create_dataset('host',(MPI.COMM_WORLD.Get_size(),), dtype='S10')
dset2[rank] = MPI.Get_processor_name()


The command file

source mpi/bin/activate
mpiexec --prefix /usr/local python

Submitting the job

qsub -cwd -S /bin/bash -pe mpi_pe 2 

Checking mpiexec

mpiexec --prefix /usr/local -n 4 -host oscar,november ~/temp/mpi4py-1.3.1/
source mpi/bin/activate
cd ~/temp/mpi4py-1.3.1/
python demo/

To avoid extracting mpi4py/demo/

#!/usr/bin/env python
Parallel Hello World

from mpi4py import MPI
import sys

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

    "Hello, World! I am process %d of %d on %s.n"
    % (rank, size, name))


If you get Host key verification failed. make sure that you can ssh to all the nodes configured for the queue (server1 is not the same as

Use NFSv4 – if you use v3 then you will get the following message:

File locking failed in ADIOI_Set_lock(fd 13,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 2164, length 4
File locking failed in ADIOI_Set_lock(fd 12,cmd F_SETLKW/7,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 5.
- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
ADIOI_Set_lock:: Input/output error
ADIOI_Set_lock:offset 2160, length 4
[hostname][[54842,1],3][btl_tcp_endpoint.c:459:mca_btl_tcp_endpoint_recv_blocking] recv(17) failed: Connection reset by peer (104)
[hostname][[54842,1],2][btl_tcp_endpoint.c:459:mca_btl_tcp_endpoint_recv_blocking] recv(15) failed: Connection reset by peer (104)


The basic idea is to split the work into chunks and then combine the results. You can see from the above that if you are using h5py then writing your results out is handled transparently, which is nice.


The v variant is used if you cannot break the data into equally sized blocks.

Barrier blocks until all processes have called it.

Getting results from all workers – this will return an array [ worker_data from rank 0, worker_data from rank 1, … ]

worker_data = ....

all_data = comm.gather(worker_data, root = 0)
if (rank == 0):
    #all_data contains the results


BCast sends data from one process to all others
Reduce combines data from all process


This is probably easier to understand than scatter/gather but you are doing extra work.

There are two obvious strategies available.

Create a results variable of the right dimensions and fill it in as each worker completes:

rank = comm.rank
size = comm.size

#Very crude e.g. if total_size not a multiple of size
total_size = 20
chunk_size = total_size / ((size - 1))

if rank == 0:
    all_data = np.zeros((total_size, 4), dtype='i4')
    num_workers = size - 1
    closed_workers = 0
    while closed_workers < num_workers:
        data = np.zeros((chunk_size, 4), dtype='i4')
        x = MPI.Status()
        comm.Recv(data, source=MPI.ANY_SOURCE,tag = MPI.ANY_TAG, status = x)
        source = x.Get_source()
        tag = x.Get_tag()
        insert_point = ((tag - 1) * chunk_size)
        all_data[insert_point:insert_point+chunk_size] = data
        closed_workers += 1

Wait for each worker to complete in turn and append to the results

AC_TAG = 99 
if rank == 0:
   for i in range(size-1):
       data = np.zeros((chunk_size, 4), dtype='i4')
       comm.Recv(data, source=i+1,tag = AC_TAG)
       if i == 0:
         all_data = data
         all_data = np.concatenate((all_data,data))

Just as an example here we are expecting the data to be a numpy 2d array but it could be anything and could just be created once with np.empty as the contents will be overwritten.

The key difference to notice is the value of the source and tag parameters to comm.Recv this needs to be matched by the corresponding parameter to comm.Send i.e. tag = rank for the first example, tag = AC_TAG for the second
e.g. comm.Send(ac, dest=0,tag = rank)
Your use of tag and source may vary…

Input data

There are again different ways to do this – either have the rank 0 do all the reading and use Send/Recv to send the data to be processed or let each worker get it’s own data.


MPI.Wtime() can be used to get the elapsed time between two points in a program

Aikau and CMIS

This is still a work in progress but now has a released version and, with a small amount of testing seems work, do please feel free to try it out and feedback either via the blog or as an issue on github.

The post was originally published in order to help with jira 21647.

Following on from my previous post CMIS Dojo store I thought I’d provide a example working with Aikau and the store from github

Note that this is not intended to be a detailed tutorial on working with Aikau, or CMIS, but should be enough to get you going.

As a caveat there are some fairly significant bugs that cause problems with this:


The code is available as a jar for use with Share but, of course, there’s nothing to stop you using the javascript on its own as part of an Aikau (or dojo) based application.

Just drop the jar into the share/WEB-INF/lib folder or, if you are using maven, you can install using with the following dependency.



A good example for Aikau is Share Page Creation Code

My scenario is as follows:

We have a custom content type used to describe folders containing some work. These folders can be created anywhere however it’s useful to have a page that lists all the custom properties on all the folders of this type. As an added bonus we’ll make these editable as well.

The first thing I’m going to do is write my CMIS query and make sure it returns what I want.
It will end up something like this:
SELECT * FROM wrighting:workFolder join cm:titled as t on cmis:objectId = t.cmis:objectId

It is better to enumerate the fields rather than using * but I’m using * to be concise here.

Simple Configuration

As part of the dojoCMIS jar there’s a handy widget called CmisGridWidget that inspects that data model to fill in the detailed configuration of the column definitions.

You do need to define which columns you want to appear in the data but that is fairly straightforward.

So in Data Dictionary/Share Resources/Pages create a file of type surf:amdpage, with content type application/json. See the file in aikau-example/

You can then access the page at /share/page/hrp/p/name

  "widgets": [{
    "name": "wrighting/cmis/CmisGridWidget",
    "timeout" : 10000,
    "config": {
      "query": {
        "path": "/Sites/test-site/documentLibrary/Test"
      "columns" : [ {
            "parentType": "cmis:item",
            "field" : "cmis:name"
          }, {
            "parentType": "cm:titled",
            "field" : "cm:title"
          }, {
            "parentType": "cm:titled",
            "field" : "cm:description"
          }, {
            "parentType": "cmis:item",
            "field" : "cmis:creationDate"

You’ll notice that this is slightly different from the example/index.html in that it uses CmisGridWidget instead of CmisGrid. (I think it’s easier to debug using example/index.html).
The Widget is the same apart from inheriting from alfresco/core/Core, which is necessary to make it work in Alfresco, and using CmisStoreAlf instead of CmisStore to make the connection.

There are a couple of properties that can be used to improve performance.

If you set configured: true then CmisGrid won’t inspect the data model but will use the columns as configured. If you want to see what a fully configuration looks like then set loggingEnabled: true and the full config will be logged, and can be copied into your CmisGrid definition. Note that if you do this then changes to the data model, e.g. new LIST constraint values, won’t be dynamically updated.

What it does in the jar

Share needs to know about my extensions. I’ve also decided that I’m going to import dgrid because I want to use a table to show my information but the beauty of this approach is that you can use any dojo widget that understands a store so there are a lot to choose from. (I don’t need to tell it about the wrighting or dgrid packages because that’s already in the dojoCMIS jar)

So in the jar these are defined in ./src/main/resources/extension-module/dojoCMIS-extension.xml, if you were doing something similar in your amp you’d add the file src/main/amp/config/alfresco/web-extension/site-data/extensions/example-amd-extension.xml

         <id>example Package</id>
           <config evaluator="string-compare" condition="WebFramework" replace="false">
                    <package name="example" location="js/example"/>

For convenience there’s also a share-config-custom.xml which allows you to specialize the type to surf:amdpage

CmisGrid will introspect the data model, using CMIS, to decide what to do with each field listed in the columns definition.


The targetRoot is slightly tricky due to authentication issues.

Prior to 5.0.d you cannot authenticate against the CMIS API without messing about with tickets otherwise you’ll be presented with a popup to type in your username and password (basic auth). (The ticket API seems to have returned in repo 5.2 so it should be possible to use that again – but untested)

(For an example using authentication tickets see this example)

In 5.0.d and later it will work by using the share proxy by default, however updates (POST) are broken – see JIRA referenced above.

You can use all this outside Share (see example/index.html in the git repo) but you’ll need to get all your security configuration sorted out properly.

Which Store should I use?

There are two stores available – wrighting/cmis/store/CmisStore and wrighting/cmis/store/CmisStoreAlf.

Unsurprising CmisStoreAlf is designed to be used within Alfresco as it uses CoreXhr.serviceXhr to make calls.

CmisStore uses jsonp callbacks so is suitable for use outside Alfresco. CmisStore will also work inside Alfresco under certain circumstances e.g. if CSRF protection isn’t relevant.

Detailed Configuration

If you want more control over your configuration then you can create your own widget as shown below.
The jsonModel for the Aikau page (eg in src/main/amp/config/alfresco/site-webscripts/org/example/alfresco/components/page/get-list.get.js) should contain a widget definition along with the usual get-list.desc.xml and get-list.get.html.ftl (<@processJsonModel group="share"/>)

model.jsonModel = {
 widgets : [
    name : "example/work/List",
    config : {}

Now we add the necessary js files in src/main/amp/web/js according to the locations specified in the configuration above.

So I’m going to create a file src/main/amp/web/js/example/work/List.js

Some things to point out:

This is quite a simple example showing only a few columns but it’s fairly easy to extend.

Making the field editable is a matter of defining the cell as:
editor(config, Widget) but look at the dtable docs for more details.

I like to have autoSave enabled so that changes are saved straight away.

To stop the post getting too cluttered I’m not showing the List.css or files.

There another handy widget called cggh/cmis/ModelMultiSelect that will act the same as Select but provide MultiSelect capabilities.

The List.html will contain

 <div data-dojo-attach-point="wrighting_work_table"></div>


                "dojo/_base/array", // array.forEach
                "dojo/_base/declare", "dojo/_base/lang", "dijit/_WidgetBase", "dijit/_TemplatedMixin", "dojo/dom", "dojo/dom-construct",
                "wrighting/cmis/store/CmisStore", "dgrid/OnDemandGrid", "dgrid/Editor", "dijit/form/MultiSelect", "dijit/form/Select",
                "dijit/form/DateTextBox", "dojo/text!./List.html", "alfresco/core/Core"
        function(array, declare, lang, _Widget, _Templated, dom, domConstruct, CmisStore, dgrid, Editor, MultiSelect, Select, DateTextBox, template, Core) {
            return declare(
                            _Widget, _Templated, Core
                        cssRequirements : [
                                    cssFile : "./List.css",
                                    mediaType : "screen"
                                }, {
                                    cssFile : "js/lib/dojo-1.10.4/dojox/grid/resources/claroGrid.css",
                                    mediaType : "screen"
                                  cssFile: 'resources/webjars/dgrid/1.1.0/css/dgrid.css'
                        i18nScope : "WorkList",
                        i18nRequirements : [
                                i18nFile : "./"
                        templateString : template,
                        buildRendering : function wrighting_work_List__buildRendering() {
                        postCreate : function wrighting_work_List_postCreate() {
                            try {

                                var targetRoot;
                                targetRoot = "/alfresco/api/-default-/public/cmis/versions/1.1/browser";

                                this.cmisStore = new CmisStore({
                                    base : targetRoot,
                                    succinct : true

                                // is the value used

                                var formatFunction = function(data) {

                                    if (data != null) {
                                        if (typeof data === "undefined" || typeof data.value === "undefined") {
                                            return data;
                                        } else {
                                            return data.value;
                                    } else {
                                        return "";

                                var formatLinkFunction = function(text, data) {

                                    if (text != null) {
                                        if (typeof text === "undefined" || typeof text.value === "undefined") {
                                            if (data['alfcmis:nodeRef']) {
                                                return '' + text + '';
                                            } else {
                                                return text;

                                        } else {
                                            return text.value;
                                    } else {
                                        return "";
                                this.grid = new (declare([dgrid,Editor]))(
                                            store : this.cmisStore,
                                            query : {
                                                'statement' : 'SELECT * FROM wrighting:workFolder ' +
                                                    'join cm:titled as t on cmis:objectId = t.cmis:objectId',
                                            columns : [
                                                           label : this.message(""),
                                                           field : "cmis:name",
                                                           formatter : formatLinkFunction
                                                           label : this.message("work.schedule"),
                                                           field : "",
                                                           autoSave : true,
                                                           editor : "checkbox",
                                                           get : function(rowData) {
                                                               var d1 = rowData[""];
                                                               if (d1 == null) {
                                                                   return false;
                                                               var date1 = d1[0];
                                                               return (date1);
                                                       }, {
                                                           label : this.message("work.title"),
                                                           field : "",
                                                           autoSave : true,
                                                           formatter : formatFunction,
                                                           editor : "text"
                                                       }, {
                                                           field : this.message(''),
                                                           editor: DateTextBox,
                                                           autoSave : true,
                                                           get : function(rowData) {
                                                               var d1 = rowData[""];
                                                               if (d1 == null) {
                                                                   return null;
                                                               var date1 = new Date(d1[0]);
                                                               return (date1);
                                                           set : function(rowData) {
                                                               var d1 = rowData[""];
                                                               if (d1) {
                                                                   return d1.getTime();
                                                               } else {
                                                                   return null;
                                        }, this.wrighting_work_table);
                            } catch (err) {



OpenLDAP – some installation tips

These are some tips for installing OpenLDAP – you can get away without these but it’s useful stuff to know. This relates to Ubuntu 14.04.

Database configuration

It’s a good idea to configure your database otherwise it, especially the log files, can grow significantly over time if you’re running a lot of operations.

dn: olcDatabase={1}hdb,cn=config
changetype: modify
add: olcDbConfig
olcDbConfig: set_cachesize 0 2097152 0
olcDbConfig: set_lk_max_objects 1500
olcDbConfig: set_lk_max_locks 1500
olcDbConfig: set_lk_max_lockers 1500
olcDbConfig: set_lg_bsize 2097512
olcDbConfig: set_flags DB_LOG_AUTOREMOVE
add: olcDbCheckpoint
olcDbCheckpoint: 1024 10

In particular note how the checkpoint is set – without it the logs won’t be removed. There are quite a few references on the internet to setting it as part of the olcDbConfig but that doesn’t work.

ldapmodify -Y EXTERNAL -H ldapi:/// -f dbconfig.ldif

These values will be stored in /var/lib/ldap/DB_CONFIG, and also updated if changed. This should avoid the need to use any of the Berkeley DB utilities.

It’s also possible to change the location of the database and log files but don’t forget that you’ll need to update the apparmor configuration as well.

Java connection problems

If you are having problems connecting over ldaps using java (it’s always work checking with ldapsearch on the command line) then it might be a certificates problem – see

You need to copy local_policy.jar and US_export_policy.jar from the download into jre/lib/security e.g.

cp *.jar /usr/lib/jvm/java-8-oracle/jre/lib/security/

You’ll need to do this again after an update to the jre.


If you are doing a lot of command line ldap operations it can be helpful to use the -y option with a stored password file


Don’t forget to edit the value of SLAPD_SERVICES in /etc/default/slapd to contain the full hostname if you are connecting from elsewhere. IP address is recommended if you want to avoid problems with domain name lookups.


The memberOf overlay doesn’t seem that reliable in a clustered configuration so it may be necessary to remove and readd from groups in order to have it working.

Mapping groupOfNames to posixGroup

See this serverfault article using this schema
You need to replace the nis schema, so first of all find out the dn of the existing nis schema

slapcat -n 0 | grep 'nis,cn=schema,cn=config'

This will give you something like dn: cn={2}nis,cn=schema,cn=config
Now you need to modify the rfc2307bis.ldif so that you can use ldapmodify. This is a multi-stage process.
First change the schema

dn: cn={2}nis,cn=schema,cn=config
changetype: modify
replace: olcAttributeTypes
replace: olcObjectClasses

It’s still got the original name at this point so let’s change that as well

dn: cn={2}nis,cn=schema,cn=config
changetype: modrdn
newrdn: cn={2}rfc2307bis
deleteoldrdn: 1

Quick check using slapcat but I get an error!

/etc/ldap/slapd.d: line 1: substr index of attribute "memberUid" disallowed
573d83c3 config error processing olcDatabase={1}hdb,cn=config: substr index of attribute "memberUid" disallowed
slapcat: bad configuration file!

so another ldapmodify to fix this – I’ll just remove it for now but it would be better to index member instead.

dn: olcDatabase={1}hdb,cn=config
changetype: modify
delete: olcDbIndex
olcDbIndex: memberUid eq,pres,sub

groupOfNames and posixGroup objectClasses can now co-exist.

On a client machine you will need to add the following to /etc/ldap.conf

nss_schema rfc2307bis
nss_map_attribute uniqueMember member

This isn’t entirely intuitive! You might expect nss_map_attribute memberUid member and whereas that sort of works it doesn’t resolve the dn to the uid of the user and is therefore effectively useless.

Dynamic groups

Make sure you check the N.B.!
I tried this for mapping groupOfNames to posixGroup but it doesn’t work for that use case, however it’s potentially useful so I’m still documenting it.
You need to load the dynlist overlay (with ldapadd)

dn: cn=module{0},cn=config
changetype: modify
add: olcModuleLoad
olcModuleLoad: dynlist

then configure then attr set so that the uid maps to memberUid

dn: olcOverlay=dynlist,olcDatabase={1}hdb,cn=config
objectClass: olcOverlayConfig
objectClass: olcDynamicList
olcOverlay: dynlist
olcDlAttrSet: posixGroup labeledURI memberUid:uid

You then need to add the objectClass labeledURIObject to your posixGroup entry and define the labeledURI e.g.


Now if you search in ldap for your group it will list the memberUid that you expect.
You can run getent group mygroup and it will report the members of that group correctly.
N.B.For practical purposes this doesn’t actually work see this answer on StackOverflow
This post describing using the rfc2307bis schema for posix groups looks interesting as well.

Running in debug


/usr/sbin/slapd -h ldapi:/// -d 16383 -u openldap -g openldap

client set up

Make sure the box can access the LDAP servers

Add the server to inbound security group rules e.g. 636 <ipaddress>/32
apt-get install ldap-utils

Optionally test with

ldapsearch -H ldaps:// -D “cn=system,ou=users,ou=system,dc=mydomain,dc=com” -W ‘(objectClass=*)’ -b dc=mydomain,dc=com

Set up a person in LDAP by adding objectClasses posixAccount and ldapPublicKey

apt-get install ldap-auth-client

See /etc/default/slapd on the ldap server

ldaps:// ldaps://

Make local root Database admin – No
LDAP database require login – Yes
use password

Settings are in /etc/ldap.conf

If you want home directories to be created then add the following to /etc/pam.d/common-session

session required

You can checkout autofs-ldap or pam_mount if you’d prefer to mount the directory.(might require ref2307bis)

Now run the following commands

auth-client-config -t nss -p lac_ldap

Now test
#su – myldapaccount

Check /var/log/auth.log if problems

If you want to use LDAP groupOfNames as posixGroups see above.

For ssh keys in LDAP – add the sshPublicKey to the ldap record. Multiple keys can be stored. e.g. using openssh-lpk_openldap

Make sure ssh server is correctly configured

dpkg-reconfigure openssh-server

Add the following to /etc/ssh/sshd_config – both are needed, then create the file using the contents below

Restart the ssh service after doing both steps and check that it has restarted (pid given in the start message)

AuthorizedKeysCommand /etc/ssh/
AuthorizedKeysCommandUser nobody

Contents of /etc/ssh/
You can restrict access by modifying the ldapsearch command
Access can also be restricted by using the host field in the ldap user record but that’s more complicated

script must only be writeable by root


uri=`grep uri /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`
binddn=`grep binddn /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`
bindpw=`grep bindpw /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`
base=`grep base /etc/ldap.conf| egrep -v ^# | awk ‘{print $2}’`


for u in `grep uri /etc/ldap.conf| egrep -v ^# | awk ‘{for (i=2; i<=NF; i++) print $i}’` do ldapsearch -H ${u} \ -w “${bindpw}” -D “${binddn}” \ -b “${base}” \ ‘(&(objectClass=posixAccount)(uid='”$1″‘))’ \ ‘sshPublicKey’ > $TMPFILE
grep sshPublicKey:: $TMPFILE > /dev/null
if [ $? -eq 0 ]
sed -n ‘/^ /{H;d};/sshPublicKey::/x;$g;s/\n *//g;s/sshPublicKey:: //gp’ $TMPFILE | base64 -d
sed -n ‘/^ /{H;d};/sshPublicKey:/x;$g;s/\n *//g;s/sshPublicKey: //gp’ $TMPFILE
if [ $RESULT -eq 0 ]

Command reference

ldapsearch -H ldapi:/// -x -y /root/.ldappw -D 'cn=admin,dc=mydomain,dc=com' -b 'dc=mydomain,dc=com' "(cn=usersAdmin)"

Note that the syntax of the LDIF files for the next two commands is somewhat different

Adding entries
ldapadd -H ldapi:/// -x -y ~/.ldappw -D 'cn=admin,dc=mydomain,dc=com' -f myfile.ldif

Making changes
ldapmodify -Y EXTERNAL -H ldapi:/// -f myfile.ldif

Recursively removing a sub-tree
ldapdelete -H ldapi:/// -x -y ~/.ldappw -D "cn=admin,dc=mydomain,dc=com" -r "ou=tobedeleted,dc=mydomain,dc=com"

Checking databases used
ldapsearch -H ldapi:// -Y EXTERNAL -b "cn=config" -LLL -Q "olcDatabase=*" dn

A dojo store for the cmis browser binding

First of all why am I doing this?

Dojo is a popular javascript library which is used extensively and, of particular interest, is coming to more prominence within Alfresco. is based on HTML5/W3C’s IndexedDB object store API
and is useful because stores can be used to provide the data access methods for a wide range of dojo/dijit widgets and are especially useful to easily visualize data in any number of ways.

CMIS is standard used to access content stored in a repository, such as Alfresco and, particularly with the advent of the browser binding in CMIS 1.1, it makes it possible to manage information within that repository using a series of HTTP requests.

While the CMIS API is relatively straightforward there are some wrinkles, particularly with respect to cross-origin requests, so it seems to make sense, allied to the advantages of having the API available as a dojo store, to provide a wrapper for the simple actions at least.

So now I’ve explained my motivation on with a brief description and some basic examples: (This is available in example.html in the git repository)

The first thing to do is to create a store:

var targetRoot = 

this.cmisStore = new CmisStore({
                                 base: targetRoot,
                                 succinct: true

The first thing to notice is the value of base – note that there is no /root at the end – this will be automatically appended by the store (use the root option if you need to change this)

Next we need to attach it to something I’m going to use dijit.Tree here.

We’ll need to provide a mapping to the ObjectStoreModel – I’ll also wrap it in Observable so we can see any changes.(this doesn’t work brilliantly for dijit.Tree as it puts in a duplicate node at the root level as well as it the right place – I haven’t worked out why yet – probably something to do with not knowing the parent)

You’ll see the query parameter which is used to determine what to fetch – this could be a path to a folder

We also have to provide a function to determine whether it’s a leaf node – contrary to the documentation this function isn’t called if it’s in the CmisStore (neither does getLabel)

    this.treeStore = new Observable(this.cmisStore);
    // Create the model
    var model = new ObjectStoreModel({
        store: this.treeStore,
        query: { path: this.path, cmisselector: 'object'},
        labelAttr: "cmis:name",
        mayHaveChildren : function(data) {
                    if (data['cmis:baseTypeId'] == 'cmis:folder') {
                        return true;
                    } else {
                        return false;

Now that we’ve done that we can create our Tree.

   // Create the Tree.
    this.tree = new Tree({
        model: model

That’s it – you’ll have a nice tree representation of your CMIS repository – it’s as easy to use other widgets like one of the data grids – plenty of scope to get creative! (e.g.

Making changes

Here you can see some code to add a folder.
First you fetch the parent folder – this can be done either by path or objectId. If the parameter contains a / or the { usePath: true} is set as the second options parameter then it’s a path otherwise it’s a objectId.

This object is then set as parent in the options parameter of the store.add call as shown in the example.

Finally once the folder has been added the grid can be refreshed to show the new folder.

You’ll see that there’s a formatter function to take account whether the succinct version of the CMIS response is being used – note this only handles simple values.

lang.hitch is used so that the function called inside the “then” has access to the same context.

addFolder: function addFolder() {
      //Get the test folder to be the parent of the new folder
      this.cmisStore.get(this.testFolder).then(lang.hitch(this, function(result) {

                               'cmis:objectTypeId': 'cmis:folder', 
                               'cmis:name': this.tempFolder
                              }, { 
                                parent: result 
                               lang.hitch(this,function( result) {
//Do something
                     }), function (error) {
                     }, function (progress) {


Making changes to your content is very easy – the store handles it – so all you’ve got to do is make sure your widget handles the data correctly.

One way you might like to edit your data is via dgrid – this makes it extremely straightforward to add an editor to a column e.g.

       label : ("pub.title"),
       field : "",
       formatter : formatFunction,
       editor : "text",
       autoSave : true

One thing you will notice is that my field is called this is because the query on my OnDemandGrid is defined like this:

  query : {
    'statement' : 'SELECT * FROM cmis:folder 
                           join cm:titled as t on cmis:objectId = t.cmis:objectId',

The code inside the store put method will strip off the leading alias i.e. upto and including the .

You need to be aware that not all the properties can be updated via CMIS – the store has a couple of ways of handling this working by either using a list of excluded properties or allowed properties. This is determined by the putExclude property which is set to true or false.

If you are working with custom properties then you may need to modify the list – this can be done by modifying the excludeProperties or allowedProperties members of the store e.g.


Note this works on the absolute property name, not the namespace trimmed value.

The store will post back the entire object, not just the changed properties so you either need to make sure that the value is valid or exclude the property.

Error handling isn’t covered here and will depend on which widget you’re using.


For handling dates you need to convert to/from the CMIS date (secs since epoch) to a javascript Date object as part of the editor definition.
Use dijit/form/DateTextBox as your editor widget.

     field : 'p.myns:myDate',
     autoSave : true,
     get: function(rowData) {
         var d1 = rowData["p.myns:myDate"];
         if (d1 == null) {
             return null;
         var date1 = new Date(d1[0]);
     set: function (rowData) {
         var d1 = rowData["p.myns:myDate"];
         if (d1) {
             return d1.getTime();
         } else {
             return null;
 }, DateTextBox),

The CMIS server will error if it is sent an empty string as a datetime value so in order to avoid this the CmisStore will not attempt to send null values.


For a simple select just use a dijit/form/Select widget as the basis for you editor and set the options using the editorArgs e.g.

 label : ("pub.type"),
 field : "p.cgghPub:type",
 editorArgs : {
    options : [
                 { label : "one", value : "1"}
}, Select)


MultiSelect doesn’t have the luxury of using options in the constructor – the easiest way I found is to create your own widget and use that e.g.

declare("CategoryMultiSelect", MultiSelect, {
                                    size : 3,
                                    postCreate : function() {
                                        domConstruct.create('option', {
                                            innerHTML : 'cat1',
                                            value : 'cat1'
                                        }, this.domNode);
                                        domConstruct.create('option', {
                                            innerHTML : 'cat2',
                                            value : 'cat2'
                                        }, this.domNode);
                                        domConstruct.create('option', {
                                            innerHTML : 'cat3',
                                            value : 'cat3'
                                        }, this.domNode);

Other information:

The most useful settings for query are either a string, representing a path or object id, or an object containing either/both of the members path and statement where statement is a CMIS query e.g. SELECT * FROM cmis:document.

The store uses deferred functions to manipulate the query reponse so that either the succinctProperties or the properties object for each item are returned – if you’re not using succinct (the default) then make sure you get the value for your property

The response information is retrieved by making a second call to get the transaction information.

Add actual makes three calls to the server – add, retrieve the transaction and then fetch the new item – although it’s not documented it seems that Tree at least expects the response from add to be the created item.

The put method only allows you to update a limited number of properties (note cmis:description is not the same as cm:description) and returns the CMIS response rather than the modified object

Remove makes the second call despite the fact that’s it’s not in a transaction – this allows response handling to happen.

For documentation there is the Dojo standard – I did also consider using JSDoc but decided to stick with Dojo format

There are some tests written with Intern however they are fairly limited – not least because there’s a very simple pseudo CMIS server used.

Getting started with Hadoop 2.3.0

Googling will get you instructions for the old version so here are some notes for 2.3.0

Note that there appears to be quite a difference with version 2 although it is supposed to be mostly compatible

You should read the whole post before charging off and trying any of this stuff as you might not want to start at the beginning!

which has a script at: – this is good but needs changes around the downloading of the hadoop file – be careful if you run it more than once

Changes from the blog (not necessary if using the script)

in ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

sudo ssh hduser@localhost -i /home/hduser/.ssh/id_rsa

If gives errors

hdfs getconf -namenodes

If you see the following:

OpenJDK 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/ which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
14/03/13 15:27:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Try the following in /usr/local/hadoop/etc/hadoop/

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

(Although this should work in ~/.bashrc it appears not to)

Using files in HDFS

#Create a directory and copy a file to and fro
hadoop fs -mkdir -p /user/hduser
hadoop fs -copyFromLocal someFile.txt someFile.txt

hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt

#Get a directory listing of the user’s home directory in HDFS

hadoop fs –ls

#Display the contents of the HDFS file /user/hduser/someFile.txt
hadoop fs –cat /user/hduser/someFile.txt

#Delete the file
hadoop fs –rm someFile.txt

Doing something

Context is everything so what am I trying to do?

I am working with VCF (Variant Call Format) files which are used to hold genetic information – I won’t go into details as it’s not very relevant here.

VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

Hadoop itself is written in Java so the natural choice for interacting with it is to use a Java client and while there is a VCF reader in GATK (see it is more common to use python.

Tutorials in Data-Intensive Computing gives some great, if incomplete at this time, advice on using Hadoop Streaming together with pyvcf (there’s some nice stuff on using Hadoop on a more traditional cluster as well which is an alternative to the methods described above)

Pydoop provides an alternative to Streaming via hadoop pipes but seems not to have quite caught up with the current state of play.

Another possibility is to use Jython to translate the python into java see here

One nice thing about using Streaming is that it’s fairly easy to do a comparison between a Hadoop implementation and a traditional implementation.

So here are some numbers (using the from the Data-Intensive Computing tutorial)

Create the header file -b data.vcf > header.txt


date;$(which python) $PWD/ -m $PWD/header.txt,0.30 < data.vcf |
$(which python) $PWD/ -r > out;date

Hadoop (Single node on the same computer)

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -mapper "$(which python) $PWD/ -m $PWD/header.txt,0.30" -reducer "$(which python) $PWD/ -r" -input $PWD/vcfparse/data.vcf -output $PWD/vcfparse/output

The output files contain the same data however the rows are held in a different order.

When running on a cluster we’ll need to use the -combiner option and -file to ship the scripts to the cluster.

The MapReduce framework orders the keys in the output, if you’re doing this in Java you will get an iterator for each key but, obviously, not when you’re streaming.

Running locally with a 1020M test input file seems to indicate a good speed up (~2 mins vs ~6 mins) so now I’ve tried it with a relatively small file it’s time to scale up a bit to a 12G file and moving to an 8 processor VM (slower disk) – not an ideal test machine but it’s what I’ve got easily to hand and is better than using my desktop where there are other things going on.


You can look at some basic statistics via http://localhost:8088/

Note that it does take a while to copy the file to/from the Hadoop file system which is not included here

Number of splits: 93

method Map Jobs Reduce Jobs Time
pipes N/A N/A 2 hours 46 mins 17 secs
Single Node Default Default 49 mins 29 secs
Single Node 4 4  1 hr 14 mins 51 secs
Single Node 6 2 1 hr 6 secs
Single Node 2 6 1 hr 13 mins 25 secs

An example using streaming, Map/Reduce with a tab based input file

Assuming you’ve got everything set up

Start your engines

If necessary

dfs is the file system

yarn is the job scheduler

Copy your input file to the dfs

hadoop fs -mkdir -p /user/hduser
hadoop fs -copyFromLocal someFile.txt data
hadoop fs -ls -h

The task

The aim is to calculate the variant density using a particular window on the genome.

This is a slightly more complex version of the classic “hello world” of hadoop – the word count.

Input data

The input file is a tab delimited file containing one line for each variant 28G, over 95,000,000 lines.

We are interested in the chromosome, position and whether the PASS filter has been applied.

The program

First we need to work out which column contains the PASS filter – awk is quite helpful to check this

head -1 data | awk -F\t '{print $18}'

(Remember awk counts from 1 not 0)

The mapper

For the mapper we will build a key/value pair for each line – the key is a combination of the chromosome and bucket (1kb window) and the value a count and whether it passes/fails (we don’t really need the count…)

#!/usr/bin/env python

import sys

for line in sys.stdin:
    cells = line.split('t')
    chrom = cells[0]
    pos = cells[1]
    pass_filter = None
    if (cells[17] == "True"):
      pass_filter = True
    if (cells[17] == "False"):
      pass_filter = False
    if (pass_filter is not None):
      bucket = int(int(pos)/1000)
      point = (bucket + 1) * (1000 / 2)
      print ("%s-%dt1-%s" % (chrom, point, str(pass_filter)))


You can easily test this on the command line using pipes e.g.

head -5 data | python

The reducer

The reducer takes the output from the mapper and merges it according to the key

Test again using pipes


import sys

last_key = None
running_total = 0
passes = 0

for input_line in sys.stdin:
    input_line = input_line.strip()
    this_key, value = input_line.split("t", 1)
    variant, pass_filter = value.split('-')
    if last_key == this_key:
        running_total += int(variant)
        if (pass_filter == "True"):
          passes = passes + 1
        if last_key:
            chrom, pos = last_key.split('-')
            print( "%st%st%dt%d" % (chrom, pos, running_total, passes) )
        running_total = int(variant)
        if (pass_filter == "True"):
          passes = 1
          passes = 0
        last_key = this_key

if last_key == this_key:
    chrom, pos = last_key.split('-')
    print( "%st%st%dt%d" % (chrom, pos, running_total, passes) )



head -5 data | python | python



Note the mapper and reducer scripts are on the local file system and the input and output files on the hfs

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -mapper "$(which python) $PWD/" -reducer "$(which python) $PWD/" -input data -output output

Copy the output back to the local file system

hadoop fs -copyToLocal output

If you want to sort the output then the following command does a nice job

sort -V output/part-00000

Don’t forget to clean up after yourself

hadoop fs -rm -r output
hadoop fs -rm data

Now we’ve got the job running we can look at start to make it go faster. The first thing to try is to increase the number of tasks – I’m using an 8 processor VM so I’ll try 4 of each to start with (the property names for doing this have changed)

hadoop fs -rm -r output
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -D mapreduce.job.reduces=4 -D mapreduce.job.maps=4 -mapper "$(which python) $PWD/" -reducer "$(which python) $PWD/" -input data -output output

Looking at the output I can see

INFO mapreduce.JobSubmitter: number of splits:224

This seems to indicate that I could usefully go up to 224(*2) jobs if I had enough cores free


method Map Jobs Reduce Jobs Time
pipes N/A N/A 31 mins 31 secs
Single Node Default Default 39 mins 6 secs
Single Node 4 4 28 mins 18 secs
Single Node 5 3 33 mins 1 secs
Single Node 3 5 31 mins 8 secs
Single Node 5 5 28 mins 40 secs
Single Node 6 2 49 mins 55 secs


From these brief experiments it looks like there is no point using the Map/Reduce framework for trivial tasks even on large files.

A more positive result is that it looks like there may well be some advantage for more complex tasks and this merits some further investigation as I’ve only scratched the surface here.

Some things to look at are:

The output won’t be in the same order as the input so if this is important Hadoop streaming has Comparator and Partitioner to help sort results from the map to the reduce
You can decide to split the map outputs based on certain key fields, not the whole keys see the Hadoop Partioner Class
See docs for 1.2.1 here

How do I generate output files with gzip format?

Instead of plain text files, you can generate gzip files as your generated output. Pass ‘-D mapred.output.compress=true -D’ as option to your streaming job.

How to use a compressed input file

Alfresco as Extranet

In a couple of projects I’ve worked on we’ve been using Alfresco as an extranet – that’s to say we’ve given external people access to our Alfresco instance so that we can collaborate by sharing documents and using the other site functions like discussion lists and wikis.

We’ve also had these Alfresco instances integrated into a wider single sign on system.

We want people to be able to self register into the SSO system, for a number of reasons.

This has lead to a couple of problems.

Firstly we don’t want somebody to be able to self register and then log into Alfresco and collect our user list by doing a people search.
Secondly we’d like to be able to restrict who can log into Alfresco but give a helpful message if they’ve authenticated successfully.
Thirdly we want to restrict site creation.

Restricting site creation

I’ll cover this first because it’s quite straightforward and documented elsewhere.

There are two parts to the problem:
1) Blocking access to the api
2) Removing the menu option in the UI.

Part 1 can be done by modifying the appropriate bean from public-services-security-context.xml
Part 2 will depend on your version of Alfresco and is adequately covered elsewhere.

Restricting access to the user list

This has come up a few times

It’s even in the To Do section on the wiki

Simple approach

The simplest approach is to change the permissions on /sys:system/sys:people
You can do this by finding the nodeRef using the Node Browser and going to: share/page/manage-permissions?nodeRef=xxxx

You’ll need to create a group of all your users and give them read permission, replacing the EVERYONE permission.

You could get carried away with this by changing the permissions on individual users but that’s not a great idea.

More complex approach

A more complex approach is to use ACLs in a similar fashion to the approach used to block site creation however this does require some custom code and still isn’t perfect.

There are some changes required to make this work nicely above and beyond creating the custom ACL code

In org.alfresco.repo.jscript.People.getPeopleImpl if getPeopleImplSearch is used then
1) it’s not using PersonService.getPeople
2) if it’s using FTS and afterwards PersonService.getPerson throws a AccessDeniedException then it will cause an error (which in the case of an exception will fall through thereby giving the desired result but not in a good way as the more complex search capabilities will be lost)
This, I think, would be a relatively simple change although I’m not sure whether to catch an exception or use the getPersonOrNull method and ignore the null – I’m going with the later

// FTS
  List personRefs = getPeopleImplSearch(filter, pagingRequest, sortBy, sortAsc);

  if (personRefs != null)
    persons = new ArrayList(personRefs.size());
    for (NodeRef personRef : personRefs)
      Person p = personService.getPersonOrNull(personRef);
      if (p != null) {

The usernamePropertiesDecorator bean ( will throw an exception if the access to the person bean is denied – this will have a major impact so we need to replace this with a custom implementation that swallows the exception and outputs something sensible instead.

I’ve logged an issue to get these fixes made.

Oddities that don’t appear break things

The user profile page /share/page/user/xxxx/profile will show your own profile if you try and access a profile that you don’t have access for – strange but relatively harmless
The relevant exceptions are:

There are numerous places where the user name will be shown instead of the actual name if permission is denied to access the actual person record, i.e. it’s not using the usernamePropertiesDecorator, this appears to be done via Alfresco.util.userProfileLink. While far from ideal this isn’t too bad as this information will only be shown if you have access to a node when you don’t have access to the creator/modifier information e.g. a shared document.

Other approaches

It looks like there are a few ways to go about doing this…

The forum posts listed discuss(sketchily!) modifying the client side java script and the webscripts

At the lowest level you could modify the PersonService and change the way that the database is queried but that seems too low level

config/alfresco/ibatis/alfresco-SqlMapConfig.xml defines queries
config/alfresco/ibatis/org.hibernate.dialect.Dialect/query-people-common-SqlMap.xml defines alfresco.query.people
which is used in
which in turn is used by

Restricting access

As this seems to have come up a few times as well…

I’m trying to work out if it’s possible to disable some external users.

My scenario is that I have SSO and LDAP enabled but I only want users who are members of a site to be able to access Share – ideally I’d like to be able to send other users to a static page where they be shown some information. At the moment if you attempt to access a page for which you don’t have access,e.g. share/page/console/admin-console you will go to the Share log in page (which you wouldn’t otherwise see)

I still want all the users sync’ed so using a filter to restrict the LDAP sync isn’t an option.
I only want to restrict access to Alfresco so fully disabling the account isn’t an option.

It’s relatively easy to identify the users and apply the cm:personDisabled aspect but this doesn’t appear to do anything.
See this issue.

I think the reason that the aspect doesn’t work is that the isAuthenticationMutable will return false and therefore the aspect is not being checked.

I can see the idea of not changing the sync’ed users – otherwise a full resync will lose changes
I can also see not wanting to allow updates to LDAP although the case for that is perhaps weaker

However given that it’s possible to edit profiles under these circumstances, e.g. for telephone number, wouldn’t it make more sense for the cm:personDisabled to be treated along with the Alfresco specific attributes and therefore editable rather than with the LDAP specific attributes and therefore not editable?
Actually applicable is probably a better word rather than editable as it’s possible to apply the aspect programmatically – it just doesn’t do anything.

I did think about checking some field in LDAP but I don’t think that would work without getting into custom schemas (not a terrible idea but not a great one either)

So going back to my earlier requirement to show a page to users who don’t have the requisite permission I came up with the following approach:

  • Use a cron based action to add all site users to a group all_site_users
  • Use evaluators to check if user is a member of all_site_users and if not then:
    1. Hide the title bar and top menu
    2. Hide dashboard dashlets
    3. Show a page of text

Further adventures with CAS and Alfresco (and LDAP)

Like Alfresco in the cloud and myriad other systems we’ve decided to use the email address as the user name for logging in. This works fine until you want to allow the user to be able to change their email.

The problem here is that Alfresco doesn’t support changing user names (I believe that it can be done with some database hacking but not recommended)

My solution here is to allow logging in via CAS to use the mail attribute as the user name but to pass the uid to Alfresco to use as the Alfresco user name while this means that the Alfresco user name is not the same as they’ve used to log in, it does allow you to change the mail attribute and as the user name isn’t often visible this works quite well – actually it’s not too bad to set the uid as the mail address especially if the rate of change is low although there are some situations where this is potentially confusing.

So how to do it…

First configure CAS (I’m using 4.0_RC2 at the moment)

In your deployerConfigContext.xml find your registeredServices and add

 <property name="usernameAttribute" value="uid"/>

so you end up with something like this:

<bean class="" p:id="0"
	p:name="HTTP and IMAP" p:description="Allows HTTP(S) and IMAP(S) protocols"
	p:serviceId="^(https?|imaps?)://*" p:evaluationOrder="10000001">
    <property name="usernameAttribute" value="uid"/>

For 4.1 you’ll need:

Note that you need the allowedAttributes to contain the usernameAttribute otherwise the value of the usernameAttribute will be ignored.

<bean class="" p:id="0"
<property name="usernameAttributeProvider">
c:usernameAttribute="uid" />
<property name="attributeReleasePolicy">
<bean class="">
<property name="allowedAttributes">

Now to configure Share and Alfresco (see previous posts)

If you are using CAS 4.0_RC2 then make sure that you are using the CAS 2 protocol (or SAML but I’d go with CAS 2) so if you are using the java client the in the web.xml your CAS Validation Filter will be:

   <filter-name>CAS Validation Filter</filter-name>

(This will work for CAS 1 in later versions)