Getting started with Hadoop 2.3.0

Googling will get you instructions for the old version so here are some notes for 2.3.0

Note that there appears to be quite a difference with version 2 although it is supposed to be mostly compatible

You should read the whole post before charging off and trying any of this stuff as you might not want to start at the beginning!

which has a script at: – this is good but needs changes around the downloading of the hadoop file – be careful if you run it more than once

Changes from the blog (not necessary if using the script)

in ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

sudo ssh hduser@localhost -i /home/hduser/.ssh/id_rsa

If gives errors

hdfs getconf -namenodes

If you see the following:

OpenJDK 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/ which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
14/03/13 15:27:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Try the following in /usr/local/hadoop/etc/hadoop/

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

(Although this should work in ~/.bashrc it appears not to)

Using files in HDFS

#Create a directory and copy a file to and fro
hadoop fs -mkdir -p /user/hduser
hadoop fs -copyFromLocal someFile.txt someFile.txt

hadoop fs –copyToLocal /user/hduser/someFile.txt someFile2.txt

#Get a directory listing of the user’s home directory in HDFS

hadoop fs –ls

#Display the contents of the HDFS file /user/hduser/someFile.txt
hadoop fs –cat /user/hduser/someFile.txt

#Delete the file
hadoop fs –rm someFile.txt

Doing something

Context is everything so what am I trying to do?

I am working with VCF (Variant Call Format) files which are used to hold genetic information – I won’t go into details as it’s not very relevant here.

VCF is a text file format. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

Hadoop itself is written in Java so the natural choice for interacting with it is to use a Java client and while there is a VCF reader in GATK (see it is more common to use python.

Tutorials in Data-Intensive Computing gives some great, if incomplete at this time, advice on using Hadoop Streaming together with pyvcf (there’s some nice stuff on using Hadoop on a more traditional cluster as well which is an alternative to the methods described above)

Pydoop provides an alternative to Streaming via hadoop pipes but seems not to have quite caught up with the current state of play.

Another possibility is to use Jython to translate the python into java see here

One nice thing about using Streaming is that it’s fairly easy to do a comparison between a Hadoop implementation and a traditional implementation.

So here are some numbers (using the from the Data-Intensive Computing tutorial)

Create the header file -b data.vcf > header.txt


date;$(which python) $PWD/ -m $PWD/header.txt,0.30 < data.vcf |
$(which python) $PWD/ -r > out;date

Hadoop (Single node on the same computer)

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -mapper "$(which python) $PWD/ -m $PWD/header.txt,0.30" -reducer "$(which python) $PWD/ -r" -input $PWD/vcfparse/data.vcf -output $PWD/vcfparse/output

The output files contain the same data however the rows are held in a different order.

When running on a cluster we’ll need to use the -combiner option and -file to ship the scripts to the cluster.

The MapReduce framework orders the keys in the output, if you’re doing this in Java you will get an iterator for each key but, obviously, not when you’re streaming.

Running locally with a 1020M test input file seems to indicate a good speed up (~2 mins vs ~6 mins) so now I’ve tried it with a relatively small file it’s time to scale up a bit to a 12G file and moving to an 8 processor VM (slower disk) – not an ideal test machine but it’s what I’ve got easily to hand and is better than using my desktop where there are other things going on.


You can look at some basic statistics via http://localhost:8088/

Note that it does take a while to copy the file to/from the Hadoop file system which is not included here

Number of splits: 93

method Map Jobs Reduce Jobs Time
pipes N/A N/A 2 hours 46 mins 17 secs
Single Node Default Default 49 mins 29 secs
Single Node 4 4  1 hr 14 mins 51 secs
Single Node 6 2 1 hr 6 secs
Single Node 2 6 1 hr 13 mins 25 secs

An example using streaming, Map/Reduce with a tab based input file

Assuming you’ve got everything set up

Start your engines

If necessary

dfs is the file system

yarn is the job scheduler

Copy your input file to the dfs

hadoop fs -mkdir -p /user/hduser
hadoop fs -copyFromLocal someFile.txt data
hadoop fs -ls -h

The task

The aim is to calculate the variant density using a particular window on the genome.

This is a slightly more complex version of the classic “hello world” of hadoop – the word count.

Input data

The input file is a tab delimited file containing one line for each variant 28G, over 95,000,000 lines.

We are interested in the chromosome, position and whether the PASS filter has been applied.

The program

First we need to work out which column contains the PASS filter – awk is quite helpful to check this

head -1 data | awk -F\t '{print $18}'

(Remember awk counts from 1 not 0)

The mapper

For the mapper we will build a key/value pair for each line – the key is a combination of the chromosome and bucket (1kb window) and the value a count and whether it passes/fails (we don’t really need the count…)

#!/usr/bin/env python

import sys

for line in sys.stdin:
    cells = line.split('t')
    chrom = cells[0]
    pos = cells[1]
    pass_filter = None
    if (cells[17] == "True"):
      pass_filter = True
    if (cells[17] == "False"):
      pass_filter = False
    if (pass_filter is not None):
      bucket = int(int(pos)/1000)
      point = (bucket + 1) * (1000 / 2)
      print ("%s-%dt1-%s" % (chrom, point, str(pass_filter)))


You can easily test this on the command line using pipes e.g.

head -5 data | python

The reducer

The reducer takes the output from the mapper and merges it according to the key

Test again using pipes


import sys

last_key = None
running_total = 0
passes = 0

for input_line in sys.stdin:
    input_line = input_line.strip()
    this_key, value = input_line.split("t", 1)
    variant, pass_filter = value.split('-')
    if last_key == this_key:
        running_total += int(variant)
        if (pass_filter == "True"):
          passes = passes + 1
        if last_key:
            chrom, pos = last_key.split('-')
            print( "%st%st%dt%d" % (chrom, pos, running_total, passes) )
        running_total = int(variant)
        if (pass_filter == "True"):
          passes = 1
          passes = 0
        last_key = this_key

if last_key == this_key:
    chrom, pos = last_key.split('-')
    print( "%st%st%dt%d" % (chrom, pos, running_total, passes) )



head -5 data | python | python



Note the mapper and reducer scripts are on the local file system and the input and output files on the hfs

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -mapper "$(which python) $PWD/" -reducer "$(which python) $PWD/" -input data -output output

Copy the output back to the local file system

hadoop fs -copyToLocal output

If you want to sort the output then the following command does a nice job

sort -V output/part-00000

Don’t forget to clean up after yourself

hadoop fs -rm -r output
hadoop fs -rm data

Now we’ve got the job running we can look at start to make it go faster. The first thing to try is to increase the number of tasks – I’m using an 8 processor VM so I’ll try 4 of each to start with (the property names for doing this have changed)

hadoop fs -rm -r output
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.3.0.jar -D mapreduce.job.reduces=4 -D mapreduce.job.maps=4 -mapper "$(which python) $PWD/" -reducer "$(which python) $PWD/" -input data -output output

Looking at the output I can see

INFO mapreduce.JobSubmitter: number of splits:224

This seems to indicate that I could usefully go up to 224(*2) jobs if I had enough cores free


method Map Jobs Reduce Jobs Time
pipes N/A N/A 31 mins 31 secs
Single Node Default Default 39 mins 6 secs
Single Node 4 4 28 mins 18 secs
Single Node 5 3 33 mins 1 secs
Single Node 3 5 31 mins 8 secs
Single Node 5 5 28 mins 40 secs
Single Node 6 2 49 mins 55 secs


From these brief experiments it looks like there is no point using the Map/Reduce framework for trivial tasks even on large files.

A more positive result is that it looks like there may well be some advantage for more complex tasks and this merits some further investigation as I’ve only scratched the surface here.

Some things to look at are:

The output won’t be in the same order as the input so if this is important Hadoop streaming has Comparator and Partitioner to help sort results from the map to the reduce
You can decide to split the map outputs based on certain key fields, not the whole keys see the Hadoop Partioner Class
See docs for 1.2.1 here

How do I generate output files with gzip format?

Instead of plain text files, you can generate gzip files as your generated output. Pass ‘-D mapred.output.compress=true -D’ as option to your streaming job.

How to use a compressed input file

Alfresco as Extranet

In a couple of projects I’ve worked on we’ve been using Alfresco as an extranet – that’s to say we’ve given external people access to our Alfresco instance so that we can collaborate by sharing documents and using the other site functions like discussion lists and wikis.

We’ve also had these Alfresco instances integrated into a wider single sign on system.

We want people to be able to self register into the SSO system, for a number of reasons.

This has lead to a couple of problems.

Firstly we don’t want somebody to be able to self register and then log into Alfresco and collect our user list by doing a people search.
Secondly we’d like to be able to restrict who can log into Alfresco but give a helpful message if they’ve authenticated successfully.
Thirdly we want to restrict site creation.

Restricting site creation

I’ll cover this first because it’s quite straightforward and documented elsewhere.

There are two parts to the problem:
1) Blocking access to the api
2) Removing the menu option in the UI.

Part 1 can be done by modifying the appropriate bean from public-services-security-context.xml
Part 2 will depend on your version of Alfresco and is adequately covered elsewhere.

Restricting access to the user list

This has come up a few times

It’s even in the To Do section on the wiki

Simple approach

The simplest approach is to change the permissions on /sys:system/sys:people
You can do this by finding the nodeRef using the Node Browser and going to: share/page/manage-permissions?nodeRef=xxxx

You’ll need to create a group of all your users and give them read permission, replacing the EVERYONE permission.

You could get carried away with this by changing the permissions on individual users but that’s not a great idea.

More complex approach

A more complex approach is to use ACLs in a similar fashion to the approach used to block site creation however this does require some custom code and still isn’t perfect.

There are some changes required to make this work nicely above and beyond creating the custom ACL code

In org.alfresco.repo.jscript.People.getPeopleImpl if getPeopleImplSearch is used then
1) it’s not using PersonService.getPeople
2) if it’s using FTS and afterwards PersonService.getPerson throws a AccessDeniedException then it will cause an error (which in the case of an exception will fall through thereby giving the desired result but not in a good way as the more complex search capabilities will be lost)
This, I think, would be a relatively simple change although I’m not sure whether to catch an exception or use the getPersonOrNull method and ignore the null – I’m going with the later

// FTS
  List personRefs = getPeopleImplSearch(filter, pagingRequest, sortBy, sortAsc);

  if (personRefs != null)
    persons = new ArrayList(personRefs.size());
    for (NodeRef personRef : personRefs)
      Person p = personService.getPersonOrNull(personRef);
      if (p != null) {

The usernamePropertiesDecorator bean ( will throw an exception if the access to the person bean is denied – this will have a major impact so we need to replace this with a custom implementation that swallows the exception and outputs something sensible instead.

I’ve logged an issue to get these fixes made.

Oddities that don’t appear break things

The user profile page /share/page/user/xxxx/profile will show your own profile if you try and access a profile that you don’t have access for – strange but relatively harmless
The relevant exceptions are:

There are numerous places where the user name will be shown instead of the actual name if permission is denied to access the actual person record, i.e. it’s not using the usernamePropertiesDecorator, this appears to be done via Alfresco.util.userProfileLink. While far from ideal this isn’t too bad as this information will only be shown if you have access to a node when you don’t have access to the creator/modifier information e.g. a shared document.

Other approaches

It looks like there are a few ways to go about doing this…

The forum posts listed discuss(sketchily!) modifying the client side java script and the webscripts

At the lowest level you could modify the PersonService and change the way that the database is queried but that seems too low level

config/alfresco/ibatis/alfresco-SqlMapConfig.xml defines queries
config/alfresco/ibatis/org.hibernate.dialect.Dialect/query-people-common-SqlMap.xml defines alfresco.query.people
which is used in
which in turn is used by

Restricting access

As this seems to have come up a few times as well…

I’m trying to work out if it’s possible to disable some external users.

My scenario is that I have SSO and LDAP enabled but I only want users who are members of a site to be able to access Share – ideally I’d like to be able to send other users to a static page where they be shown some information. At the moment if you attempt to access a page for which you don’t have access,e.g. share/page/console/admin-console you will go to the Share log in page (which you wouldn’t otherwise see)

I still want all the users sync’ed so using a filter to restrict the LDAP sync isn’t an option.
I only want to restrict access to Alfresco so fully disabling the account isn’t an option.

It’s relatively easy to identify the users and apply the cm:personDisabled aspect but this doesn’t appear to do anything.
See this issue.

I think the reason that the aspect doesn’t work is that the isAuthenticationMutable will return false and therefore the aspect is not being checked.

I can see the idea of not changing the sync’ed users – otherwise a full resync will lose changes
I can also see not wanting to allow updates to LDAP although the case for that is perhaps weaker

However given that it’s possible to edit profiles under these circumstances, e.g. for telephone number, wouldn’t it make more sense for the cm:personDisabled to be treated along with the Alfresco specific attributes and therefore editable rather than with the LDAP specific attributes and therefore not editable?
Actually applicable is probably a better word rather than editable as it’s possible to apply the aspect programmatically – it just doesn’t do anything.

I did think about checking some field in LDAP but I don’t think that would work without getting into custom schemas (not a terrible idea but not a great one either)

So going back to my earlier requirement to show a page to users who don’t have the requisite permission I came up with the following approach:

  • Use a cron based action to add all site users to a group all_site_users
  • Use evaluators to check if user is a member of all_site_users and if not then:
    1. Hide the title bar and top menu
    2. Hide dashboard dashlets
    3. Show a page of text

Further adventures with CAS and Alfresco (and LDAP)

Like Alfresco in the cloud and myriad other systems we’ve decided to use the email address as the user name for logging in. This works fine until you want to allow the user to be able to change their email.

The problem here is that Alfresco doesn’t support changing user names (I believe that it can be done with some database hacking but not recommended)

My solution here is to allow logging in via CAS to use the mail attribute as the user name but to pass the uid to Alfresco to use as the Alfresco user name while this means that the Alfresco user name is not the same as they’ve used to log in, it does allow you to change the mail attribute and as the user name isn’t often visible this works quite well – actually it’s not too bad to set the uid as the mail address especially if the rate of change is low although there are some situations where this is potentially confusing.

So how to do it…

First configure CAS (I’m using 4.0_RC2 at the moment)

In your deployerConfigContext.xml find your registeredServices and add

 <property name="usernameAttribute" value="uid"/>

so you end up with something like this:

<bean class="" p:id="0"
	p:name="HTTP and IMAP" p:description="Allows HTTP(S) and IMAP(S) protocols"
	p:serviceId="^(https?|imaps?)://*" p:evaluationOrder="10000001">
    <property name="usernameAttribute" value="uid"/>

For 4.1 you’ll need:

Note that you need the allowedAttributes to contain the usernameAttribute otherwise the value of the usernameAttribute will be ignored.

<bean class="" p:id="0"
<property name="usernameAttributeProvider">
c:usernameAttribute="uid" />
<property name="attributeReleasePolicy">
<bean class="">
<property name="allowedAttributes">

Now to configure Share and Alfresco (see previous posts)

If you are using CAS 4.0_RC2 then make sure that you are using the CAS 2 protocol (or SAML but I’d go with CAS 2) so if you are using the java client the in the web.xml your CAS Validation Filter will be:

   <filter-name>CAS Validation Filter</filter-name>

(This will work for CAS 1 in later versions)

Adding files to your amp

When you’re writing an Alfresco extension there’s a good chance that you’ll want to do some configuration or add some files along with your code.

One option is to, via a documented process, add everything by hand but it’s neater and more reliable if you can do it as part of your amp.

The trick here is to use acp files.

These files are created via Exporting from the Alfresco client see here – there is a good chance you’ll want to edit the acp files after you’ve created them e.g. to remove system files.

If you want to include the acp file directly in your amp then you should include them as part of the bootstrap process.

This is a two part operation.

Copy the acp file to the amp e.g. in /src/main/resources/alfresco/module/org_wrighting_module_cms/bootstrap

Add the following bean definition to /src/main/resources/alfresco/module/org_wrighting_module_cms/context/bootstrap-context.xml

  <bean id="org_wrighting_module_cms_bootstrapSpaces" class="org.alfresco.repo.module.ImporterModuleComponent" 
        <property name="moduleId" value="org.wrighting.module.cms" />
        <property name="name" value="importScripts" />
        <property name="description" value="additional Data Dictionary scripts" />
        <property name="sinceVersion" value="1.0.0" />
        <property name="appliesFromVersion" value="1.0.0" />

        <property name="importer" ref="spacesBootstrap"/>
        <property name="bootstrapViews">
                     <prop key="path">/${spaces.company_home.childname}/${spaces.dictionary.childname}/app:scripts</prop>
                     <prop key="location">alfresco/module/org_wrighting_module_cms/bootstrap/wrighting_scripts.acp</prop>

This will then import your scripts to the Data Dictionary ready for use.

The acp file itself is a zip file containing an XML file describing the enclosed files – it’s a good idea to use the export action to create this as there is a fair amount of meta information involved.

If you want to expand the acp and then copy it into place then the following in your pom.xml will do the job

                     <zip basedir="${basedir}/tools/export/wrighting/wrighting_scripts.acp"
                          destfile="${}/${project.artifactId}-${project.version}/config/alfresco/module/org_wrighting_module_cms/bootstrap/wrighting_scripts.acp" />

CAS for Alfresco 4.2 on Ubuntu

Lots of confusion around on this subject so I’m going to attempt to distill some wisdom into this post and tweak it for Ubuntu

2 good blogs Nick with mod_auth_cas and Martin with CAS client and the Alfresco docs

I’m not going to talk about setting up CAS here as this post is complex enough already – I’ll just say be careful if using self signed certs.

I’ve used Martin’s method before with Alfresco 3.4

It’s a tricky decision as to which approach to use:

  • the mod_auth_cas approach is the approach supported by Alfresco but it introduces the Apache plug in which isn’t as well supported by CAS and you have problems with managing the mod_auth_cas cookie management, caching etc
  • the java client is a bit more involved and intrusive but seems to work quite well in the end
  • I haven’t tried container managed auth but it looks promising

Using mod_auth_cas

For a more detailed explanation look at Nick’s blog – this entry is more about how rather than why and is specific to using apt-get packages on Ubuntu.

First set up your mod_auth_cas

Next tell Tomcat to trust the Apache authentication by setting the following attribute tomcatAuthentication=”false” on the AJP Connector (port 8009)

Now you need to set up the Apache Tomcat Connectors module – mod-jk

apt-get install libapache2-mod-jk

Edit the properties file defined in /etc/apache2/mods-enables/jk.conf – /etc/libapache2-mod-jk/ – to set the following values


Add to your sites file e.g. /etc/apache2/sites-enabled/000-default

JkMount /alfresco ajp13_worker
JkMount /alfresco/* ajp13_worker
JkMount /share ajp13_worker
JkMount /share/* ajp13_worker

And don’t forget to tell Apache which URLs to check

<Location />
Authtype CAS
require valid-user

A more complex example in the wiki here

Add the following to tomcat/shared/classes/


Finally add the following section to tomcat/shared/classes/alfresco/web-extension/share-config-custom.xml

Note that if you have customizations you may need this in the share-config-custom.xml in your jar

 	<config evaluator="string-compare" condition="Remote">
				<name>Alfresco - unauthenticated access</name>
				<description>Access to Alfresco Repository WebScripts that do not
					require authentication

				<name>Alfresco - user access</name>
				<description>Access to Alfresco Repository WebScripts that require
					user authentication

				<name>Alfresco Feed</name>
				<description>Alfresco Feed - supports basic HTTP authentication via
					the EndPointProxyServlet

				<name>Activiti Admin UI - user access</name>
				<description>Access to Activiti Admin UI, that requires user


This gets you logged in but you still need to logout! Share CAS logout.
One thing to be careful about with using mod_auth_cas here is that you need to be aware of the mod_auth_cas caching – if you are not careful you’ll log out but mod_auth_cas will still think that you are logged in. There are some options here – set the cache timeout to be low (inefficient), use single sign out (experimental)

Using CAS java client

Martin’s blog works for Alfresco 3.4 and here are some notes I made for 4.2.d

Note that it is not supported to make changes to the web.xml

Make the following jars available:

cas-client-core-3.2.1.jar, commons-logging-1.1.1.jar, commons-logging-api-1.1.1.jar

You can do this by including them in the wars or by copying the following jars into <<alfresco home>>/tomcat/lib
N.B. If you place them into the endorsed directory then you will get error messages like this:
SEVERE: Exception starting filter CAS java.lang.NoClassDefFoundError: javax/servlet/Filter

You need to make the same changes to tomcat/shared/classes/ and share-config-custom.xml as for the mod_auth_cas method

Now add the following to share/WEB-INF/web.xml and alfresco/WEB-INF/web.xml

There’s some fine tuning to do on the url-pattern probably the best way is to copy the filter mappings for the existing authentication filter and add /page for share and /faces for alfresco.

Using the values below works but is a little crude (shown here to be concise)

    <filter-name>CAS Authentication Filter</filter-name>
    <filter-name>CAS Validation Filter</filter-name>
    <filter-name>CAS HttpServletRequest Wrapper Filter</filter-name>
    <filter-name>CAS Authentication Filter</filter-name>
    <filter-name>CAS Validation Filter</filter-name>
    <filter-name>CAS HttpServletRequest Wrapper Filter</filter-name>

Next add the following to the session-config section of the web.xml which relates to this issue which may be solved via removing the jsessionid from the url (this may cause problems with the flash uploader if you’re still using it see here)


There’s also a case for using web-fragments to avoid changing the main web.xml

You will need to redirect the change password link in the header (how to depends on version)

Container managed auth

This looks quite interesting CAS Tomcat container auth as it allows the use of the CAS java client within tomcat so being closer to the mod_auth_cas approach but without needing to configure Apache.

This issue referenced above gives some details of how somebody tried it – I think it should work if the session tracking mode is set to COOKIE but haven’t tried it.

More complex configurations

This is beyond what I’m trying to do but if you’ve got a load balanced configuration you may need to think about the session management – the easiest way to approach may be to use sticky sessions e.g.

ProxyRequests Off
ProxyPassReverse /share balancer://app
ProxyPass /share balancer://app stickysession=JSESSIONID|jsessionid nofailover=On

BalancerMember ajp://localhost:8019/share route=tomcat3
BalancerMember ajp://localhost:8024/share route=tomcat4


mod_auth_cas for CAS 3.5.2 on Ubuntu

This is not as straightforward as it should be as mod_auth_cas has not yet been brought up to date with the latest SAML 1.1 schema and the XML parsing doesn’t support the changes. In addition the pull request for the changes in github is out of date with the main branch so that’s not much help either.

That being said if you don’t use the SAML validation for attribute release you can still go ahead.

apt-get install libapache2-mod-auth-cas
a2enmod auth_cas

Configure the CAS configuration which you can do in /etc/apache2/mods-enabled/auth_cas.conf

CASCookiePath /var/cache/apache2/mod_auth_cas/
CASDebug Off
CASValidateServer On
CASVersion 2
#Only if using SAML
#CASValidateSAML Off
#CASAttributeDelimiter ;
#Experimental sign out
CASSSOEnabled On


Configure the protected directories probably somewhere in /etc/apache2/sites-enabled
N.B. You also need to ensure that the ServerName is set otherwise the service parameter on the call to CAS will contain as the hostname

    Authtype CAS
    CASAuthNHeader On
    require valid-user
    #Only works if you are using Attribute release which requires SAML validation
    #require cas-attribute memberOf:cn=helpDesk,ou=groups,dc=wrighting,dc=org
    Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
    Order allow,deny
    Allow from all

    Authtype CAS
    require valid-user


Don’t forget to restart apache service apache2 reload

Updated mod_auth_cas is now being maintained again

mkdir /var/cache/apache2/mod_auth_cas
chown www-data:www-data /var/cache/apache2/mod_auth_cas
apt-get install make apache2-prefork-dev libcurl4-gnutls-dev
git clone
make install


Editing a dojox DataGrid connected to a JsonRest store

A little bit of a problem with using a dojox.grid.DataGrid connected to a data store.

Note that dojox.grid is officially abandoned so it’s not really a very good idea to use it anyway – should be using dgrid or GridX instead.

If you edit a cell value then timing issues mean that although the change is held (and will be saved when you the displayed value reverts to the original.

One way around this is to make use of onApplyCellEdit.

You can also use this function to call if you want to save after every change.

There also appears to be a problem with adding rows not displaying as well.

 function saveCellEdit(inValue, inRowIndex, inAttrName){
	var data = this.getItem(inRowIndex);
	data[inAttrName] = inValue;
  /*create a new grid*/
  var grid = new DataGrid({
			id : 'grid',
			store : store,
			structure : layout,
			rowSelector : '20px',
			onApplyCellEdit: saveCellEdit

Some relevant links:

Using the LDAP Password Modify extended operation with Spring LDAP

If you want to change the password for a given user in an LDAP repository then you need to worry about the format in which it is being stored otherwise you will end up with the password held in plain text (although base64 encoded)

Using the password modify extended operation (rfc3062) allows OpenLDAP, in this case, to manage the hashing of the new password.

If you don’t use the extension then you have to hash the value yourself.
This code stores the new password as plaintext and treats the password as if it is any other attribute.
You can implement hashing yourself e.g. by prepending {MD5} and using the base64 encoded md5 hash of the new password – see this forum entry

Don’t use this!

	DistinguishedName dn = new DistinguishedName(dn_string);
	Attribute passwordAttribute = new BasicAttribute(passwordAttr,
	ModificationItem[] modificationItems = new ModificationItem[1];
	modificationItems[0] = new ModificationItem(
			DirContext.REPLACE_ATTRIBUTE, passwordAttribute);
	Attribute userPasswordChangedAttribute = new BasicAttribute(
			LDAP_PASSWORD_CHANGE_DATE, format.format(convertToUtc(null)
					.getTime()) + "Z");
	ModificationItem newPasswordChanged = new ModificationItem(
			DirContext.REPLACE_ATTRIBUTE, userPasswordChangedAttribute);
	modificationItems[1] = newPasswordChanged;
	getLdapTemplate().modifyAttributes(dn, modificationItems);


This example uses the extended operation which means that password will be stored according to the OpenLDAP settings i.e. SSHA by default.

The ldap template here is an instance of org.springframework.ldap.core.LdapTemplate

 ldapTemplate.executeReadOnly(new ContextExecutor() {
   public Object executeWithContext(DirContext ctx) throws NamingException {
      if (!(ctx instanceof LdapContext)) {
            throw new IllegalArgumentException(
               "Extended operations require LDAPv3 - "
               + "Context must be of type LdapContext");
      LdapContext ldapContext = (LdapContext) ctx;
      ExtendedRequest er = new ModifyPasswordRequest(dn_string, new_password);
      return ldapContext.extendedOperation(er);

This thread gives an idea of what is required however the ModifyPasswordRequest class available from here actually has all the right details implemented.

You will find that other LDAP libraries e.g. ldapChai use the same ModifyPasswordRequest class

CAS, OpenLDAP and groups

This is actually fairly straightforward if you know what you’re doing unfortunately it takes a while, for me at least, to get to that level of understanding.

Probably the most important thing missing from the pages I’ve seen describing this is that you need to configure OpenLDAP first.


What you want is to enable the memberOf overlay

For Ubuntu 12.04 the steps are as follows:
Create the files

dn: cn=module,cn=config
objectClass: olcModuleList
cn: module
olcModulePath: /usr/lib/ldap
olcModuleLoad: memberof


dn: olcOverlay=memberof,olcDatabase={1}hdb,cn=config
objectClass: olcMemberOf
objectClass: olcOverlayConfig
objectClass: olcConfig
objectClass: top
olcOverlay: memberof
olcMemberOfDangling: ignore
olcMemberOfRefInt: TRUE
olcMemberOfGroupOC: groupOfNames
olcMemberOfMemberAD: member
olcMemberOfMemberOfAD: memberOf

Then configure OpenLDAP as follows:

ldapadd -Y EXTERNAL -H ldapi:/// -f module.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f overlay.ldif

You should probably read up on this a bit more – in particular note that retrospectively adding this won’t achieve what you want without extra steps to reload the groups


The CAS documentation is actually reasonably good once you understand that you are after the memberOf attribute but for example I’ll show some config here


<bean id="attributeRepository"
    <property name="contextSource" ref="contextSource" />
    <property name="baseDN" value="ou=people,dc=wrighting,dc=org" />
    <property name="requireAllQueryAttributes" value="true" />

    <!-- Attribute mapping between principal (key) and LDAP (value) names used 
        to perform the LDAP search. By default, multiple search criteria are ANDed 
        together. Set the queryType property to change to OR. -->
    <property name="queryAttributeMapping">
            <entry key="username" value="uid" />

    <property name="resultAttributeMapping">
            <!-- Mapping beetween LDAP entry attributes (key) and Principal's (value) -->
            <entry value="Name" key="cn" />
            <entry value="Telephone" key="telephoneNumber" />
            <entry value="Fax" key="facsimileTelephoneNumber" />
            <entry value="memberOf" key="memberOf" />


After that you can setup your CAS to use SAML1.1 or modify view/jsp/protocol/2.0/casServiceValidationSuccess.jsp according to your preferences.

Don’t forget to allow the attributes for the registered services as well

<bean id="serviceRegistryDao" class="">
		<property name="registeredServices">
				<bean class="">
					<property name="id" value="0" />
					<property name="name" value="HTTP and IMAP" />
					<property name="description" value="Allows HTTP(S) and IMAP(S) protocols" />
					<property name="serviceId" value="^(https?|imaps?)://.*" />
					<property name="evaluationOrder" value="10000001" />
					<property name="allowedAttributes">


A first R project

To start with I’m using the ProjectTemplate library – this creates a nice project structure

I’m going to be attempting to analyze some census data so I’ll call the project ‘census’

I’m interested in Eynsham but it’s quite hard to work out which files to use – in the end I’ve stumbled across the parish of Eynsham at
this page and the ward of Eynsham from 2001 here which seem roughly comparable.

This blog entry is also quite interesting although I found it rather late in the process


Now we can copy some census data files into the data directory then load it all up.
(I’m not going to cover downloading the data files and creating an index of categories – it’s more painful than it should be but not that hard – I’ve used the category number as part of the file name in the downloaded files)




This doesn’t work so some experimentation is called for…

I don’t think we can munge the data until after it’s loaded so switch off the data_loading in global.dcf

So let’s create a cache of the data in a more generic fashion

With the parish data for 2011 load up the categories

datatypes = read.csv("data/datatypes.parish.2011.csv", sep="t")
datadef = t(datatypes[,1])
colnames(datadef) <- t(datatypes[,2])

parishId = "11123312"

for (d in 1:length(datadef)){
  datasetName <- paste(parishId,datadef[d], sep = ".")
  filename <- paste("data/",datasetName,".csv", sep = "")
  input.raw = read.csv(filename,header=TRUE,sep=",", skip=2)
  input.t <- t(input.raw[2:(nrow(input.raw)-4),])
  colnames(input.t) <- input.t[1,]
  input.clipped <- input.t[5:nrow(input.t),]
  input.num <- apply(input.clipped,c(1,2),as.numeric)
  dsName <- paste("parish2011_",datadef[d], sep = "")
  assign(dsName, input.num)



Then do the same for 2001

Now needless to say the data from 2001 and 2011 is represented in different ways so there’s a bit of data munging required to standardize and merge similar datasets – I’ve given an example here for population by age where in 2001 the value is given for each year whereas 2011 uses age ranges so it’s necessary to merge columns using sum


ages <- rbind(c(sum(ward2001_91[1,2:6]),sum(ward2001_91[1,7:9]),sum(ward2001_91[1,10:11]),sum(ward2001_91[1,12:16]),
ages <- ages[1:2,]
rownames(ages) <- c("Ward 2001", "Parish 2011")
colnames(ages) <- sub("Age ","",colnames(ages))
colnames(ages) <- sub(" to ","-",colnames(ages))
colnames(ages) <- sub(" and Over","+",colnames(ages))
barplot(ages, beside = TRUE, col = c("blue", "red"), legend=c("2001","2011"), main="Population", width=0.03, space=c(0,0.35), xlim=c(0,1), cex.names=0.6)


The result can be seen here

            0-4 5-7 8-9 10-14 15 16-17 18-19 20-24 25-29 30-44 45-59 60-64 65-74 75-84 85-89 90+
Ward 2001   226 160 119   335 51   112    85   202   250  1072  1003   266   427   241    61  29
Parish 2011 250 139  91   258 59   113   106   227   188   848  1005   314   584   342    80  44

Most interesting looks like a big drop in the 25-44 age range – due to house prices or lack of availability of housing as the population ages? – which is reflected in the rise in people of retirement age and numbers of children although, given the shortage of pre-school places the increase in the 0-4 range is also interesting.

Lots more analyis could be done but that’s outside the scope of this blog entry!