A first R project

To start with I’m using the ProjectTemplate library – this creates a nice project structure

I’m going to be attempting to analyze some census data so I’ll call the project ‘census’

I’m interested in Eynsham but it’s quite hard to work out which files to use – in the end I’ve stumbled across the parish of Eynsham at
this page and the ward of Eynsham from 2001 here which seem roughly comparable.

This blog entry is also quite interesting although I found it rather late in the process

install.packages("ProjectTemplate")
library('ProjectTemplate')
setwd("census")

Now we can copy some census data files into the data directory then load it all up.
(I’m not going to cover downloading the data files and creating an index of categories – it’s more painful than it should be but not that hard – I’ve used the category number as part of the file name in the downloaded files)

 

load.project()

 

This doesn’t work so some experimentation is called for…

I don’t think we can munge the data until after it’s loaded so switch off the data_loading in global.dcf

So let’s create a cache of the data in a more generic fashion

With the parish data for 2011 load up the categories

datatypes = read.csv("data/datatypes.parish.2011.csv", sep="t")
datadef = t(datatypes[,1])
colnames(datadef) <- t(datatypes[,2])

parishId = "11123312"

for (d in 1:length(datadef)){
  datasetName <- paste(parishId,datadef[d], sep = ".")
  filename <- paste("data/",datasetName,".csv", sep = "")
  input.raw = read.csv(filename,header=TRUE,sep=",", skip=2)
  input.t <- t(input.raw[2:(nrow(input.raw)-4),])
  colnames(input.t) <- input.t[1,]
  input.clipped <- input.t[5:nrow(input.t),]
  input.num <- apply(input.clipped,c(1,2),as.numeric)
  dsName <- paste("parish2011_",datadef[d], sep = "")
  assign(dsName, input.num)
  cache(dsName)

}

 

Then do the same for 2001

Now needless to say the data from 2001 and 2011 is represented in different ways so there’s a bit of data munging required to standardize and merge similar datasets – I’ve given an example here for population by age where in 2001 the value is given for each year whereas 2011 uses age ranges so it’s necessary to merge columns using sum

 

png(file="graphs/populationByAge.png",width=659,height=555)
ages <- rbind(c(sum(ward2001_91[1,2:6]),sum(ward2001_91[1,7:9]),sum(ward2001_91[1,10:11]),sum(ward2001_91[1,12:16]),
                ward2001_91[1,17],sum(ward2001_91[1,18:19]),sum(ward2001_91[1,20:21]),
                sum(ward2001_91[1,22:26]),sum(ward2001_91[1,27:31]),sum(ward2001_91[1,32:46]),
                sum(ward2001_91[1,47:61]),sum(ward2001_91[1,62:66]),sum(ward2001_91[1,67:76]),
                sum(ward2001_91[1,77:78]),sum(ward2001_91[1,79]),sum(ward2001_91[1,80:82])
                ),parish2011_2474[,c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)])
ages <- ages[1:2,]
rownames(ages) <- c("Ward 2001", "Parish 2011")
colnames(ages) <- sub("Age ","",colnames(ages))
colnames(ages) <- sub(" to ","-",colnames(ages))
colnames(ages) <- sub(" and Over","+",colnames(ages))
barplot(ages, beside = TRUE, col = c("blue", "red"), legend=c("2001","2011"), main="Population", width=0.03, space=c(0,0.35), xlim=c(0,1), cex.names=0.6)
dev.off()

 

The result can be seen here

            0-4 5-7 8-9 10-14 15 16-17 18-19 20-24 25-29 30-44 45-59 60-64 65-74 75-84 85-89 90+
Ward 2001   226 160 119   335 51   112    85   202   250  1072  1003   266   427   241    61  29
Parish 2011 250 139  91   258 59   113   106   227   188   848  1005   314   584   342    80  44

Most interesting looks like a big drop in the 25-44 age range – due to house prices or lack of availability of housing as the population ages? – which is reflected in the rise in people of retirement age and numbers of children although, given the shortage of pre-school places the increase in the 0-4 range is also interesting.

Lots more analyis could be done but that’s outside the scope of this blog entry!