Ben Rifkind bio photo

Ben Rifkind

I recently completed my Ph.D in probability at the University of Toronto. Now I am messing around with R, python, and statistical analysis.

Github

In this post I will reorganize the data that I have downloaded from the Stats Canada website so that I can start to work with it. Right now I have all census division files from Ontario in the folder “StatsCanadaData/Ontario” and in each file is a csv of all the statistics for that division. I want to extract some interesting subset of each census division data frame (say country of origin of immigrants) and then merge all these divisions together so that I have one big data frame which I can then use to map and/or visualize the data.

First I need to clean up the csv files. They have a lot of footnotes and header information that I want to remove. The next few lines fix this. I read in the files from a particular province skipping the preamble and the footer, rename the columns, and cleanUp by removing the footnote numbering.

# Read files from a particular province
# Output is a list of data frames
# Skip first 4 rows and look at only 830 total rows of interest since end of file is notes
getProvinceFiles <- function(province){
  province.dir = paste0("StatsCanadaData/", province) 
  file.names = list.files(province.dir, full.name = T)
  names(file.names) = gsub(".csv", "", list.files(province.dir, full.name = F))
  temp = sapply(file.names, read.csv, skip = 5, nrows = 830, 
                col.names = c("Characteristics", "Total", "English", "French", "English and French"), 
                header = F,
                simplify = FALSE)
  # add name of census division to data frame
  for (k in seq(length(names(temp)))){
    temp[[k]][['Division']] = names(temp)[k]
    }
  # clean up Characteristics column
  temp = sapply(temp, cleanUp, simplify = FALSE) 
  temp
  
  }

# gets rid of all the footnote numbering, $ signs I don't like, and spacing issues at 
# the end of the word. 
 cleanUp <- function(df){
   column = "Characteristics"
   df[[column]] = gsub("\\[.+\\]", "", df[[column]])
   df[[column]] = gsub("(\\$$)|(\\$\\s$)", "in CDN", df[[column]])
   df[[column]] = gsub("\\s+$", "", df[[column]])
   df
}

Now I have a ‘nice’ list of data frames for each census division

ontario.data = getProvinceFiles("Ontario")
names(ontario.data)
##  [1] "Algoma"                         "Brant"                         
##  [3] "Bruce"                          "Chatham-Kent"                  
##  [5] "Cochrane"                       "Dufferin"                      
##  [7] "Durham"                         "Elgin"                         
##  [9] "Essex"                          "Frontenac"                     
## [11] "Greater Sudbury"                "Grey"                          
## [13] "Haldimand-Norfolk"              "Haliburton"                    
## [15] "Halton"                         "Hamilton"                      
## [17] "Hastings"                       "Huron"                         
## [19] "Kawartha Lakes"                 "Kenora"                        
## [21] "Lambton"                        "Lanark"                        
## [23] "Leeds and Grenville"            "Lennox and Addington"          
## [25] "Manitoulin"                     "Middlesex"                     
## [27] "Muskoka"                        "Niagara"                       
## [29] "Nipissing"                      "Northumberland"                
## [31] "Ottawa"                         "Oxford"                        
## [33] "Parry Sound"                    "Peel"                          
## [35] "Perth"                          "Peterborough"                  
## [37] "Prescott and Russell"           "Prince Edward"                 
## [39] "Rainy River"                    "Renfrew"                       
## [41] "Simcoe"                         "Stormont, Dundas and Glengarry"
## [43] "Sudbury"                        "Thunder Bay"                   
## [45] "Timiskaming"                    "Toronto"                       
## [47] "Waterloo"                       "Wellington"                    
## [49] "York"
head(ontario.data[["Algoma"]])
##                  Characteristics  Total English French English.and.French
## 1 Total population by age groups 116075   95735   8025                585
## 2                   0 to 4 years   4890    4560    145                 50
## 3                   5 to 9 years   5640    5220    225                  0
## 4                 10 to 14 years   7285    6850    235                 10
## 5                 15 to 19 years   8005    7435    340                 45
## 6                 20 to 24 years   6695    6195    265                 75
##   Division
## 1   Algoma
## 2   Algoma
## 3   Algoma
## 4   Algoma
## 5   Algoma
## 6   Algoma

A word as to how the data frame is organized. Basically there are a bunch of categories in the census for the statistic (income, education, labour) and then there are subcategories and sub-sub-categories. The way that this is organized in the data frame is via the number of spaces before the entry. 0 spaces for head of the category, 2 spaces for the subcategory, 4 spaces for the sub-subcategory, etc. I wrote functions in order to extract the top categories and also the statistics for a given category and its subcategories.

The first function just gets all the top level characteristics of these stats.

# get the top level characteristics of the data frame
# characteristics that includes the word "Total"
# output is a list of the top level characteristics in the data frame
getCharacters <- function(df){
 df[grepl("^Total", df[["Characteristics"]]), ][["Characteristics"]] 
}

characters = getCharacters(ontario.data[[1]])
characters[1:5]
## [1] "Total population by age groups"                            
## [2] "Total population 15 years and over by legal marital status"
## [3] "Total population 15 years and over by common-law status"   
## [4] "Total population by knowledge of official languages"       
## [5] "Total population by first official language spoken"

These functions extract the particuar subset and the specified number of subcategories.

# helper function that extracts a particular characteristic of the data and all subcategories
# top category has no spaces at start of the word, sub category has 2 spaces, sub sub has 4 
# input is the data frame and characteristic we want
# value is subset of the data frame
getFullCategory <- function(df, characteristic){
 begin = grep(paste0("\\w*", characteristic), df[["Characteristics"]])
 end = begin
 bool = TRUE
 while (bool){ 
   bool = grepl("^\\s", df[["Characteristics"]][end+1])
   end = end + 1
 }
 return(df[begin:(end-1),])
}

## helper function that extracts a particular characteristic of the data and specified subcat
# input is the data frame, characteristics, and number of subcategories
# value is subset of the data frame
getSubCategory <- function(df, characteristic, depth){
 depth = 2*depth +1
 temp = getFullCategory(df, characteristic)
 temp[!(grepl(paste0("^\\s{", depth,"}"), temp[["Characteristics"]])),]
}

Here I construct a function that takes in a list of data frames along with a characteristic that we want and outputs one big data frame of that characteristic and subcharacteristics.

# Construct data frame of characteristic over all census divisions
# input is list of data frames, characteristic
# output is data frame
getCharDataFrame <- function(df.list, characteristic, depth = 2){
  temp = sapply(df.list, getFullCategory, 
                characteristic = characteristic,
                simplify = F)
  total = do.call(rbind, temp) #put all data frames together
  row.names(total) = NULL
  #reorder the data frame according to Characteristics
  characteristics = temp[[1]][[1]]
  total[['Characteristics']] = factor(total[['Characteristics']], 
                                      levels = characteristics)
  total = total[order(total[['Characteristics']]),]
  
  total[, c(1,6,2,3,4,5)]
} 

“Total population by immigrant status and place of birth” looks like an interesting statistic. I construct a big data frame of all the census divisions of “Ontario” with that stat and write it to a csv file.

immigrant.status = getCharDataFrame(ontario.data, "Total population by immigrant status and place of birth", characters[23])
head(immigrant.status)
##                                             Characteristics     Division
## 1   Total population by immigrant status and place of birth       Algoma
## 39  Total population by immigrant status and place of birth        Brant
## 77  Total population by immigrant status and place of birth        Bruce
## 115 Total population by immigrant status and place of birth Chatham-Kent
## 153 Total population by immigrant status and place of birth     Cochrane
## 191 Total population by immigrant status and place of birth     Dufferin
##      Total English French English.and.French
## 1   116080   95730   8025                585
## 39  123320  108755   1330                120
## 77   64555   59580    545                 55
## 115 107150   93405   2885                345
## 153  81465   36660  38115               1130
## 191  53920   49360    595                 90
write.csv(immigrant.status, "ImmigrantStatusOntario.csv")

Next up, mapping this data frame onto a geographical map of Ontario.