Learning Objects

This tutorial aims to introduce some basic ways to implement ml in R.

  1. Learn how to use R caret package to train a supervised ml model.

  2. Learn how to use R keras to train a basic neural network mode.

  3. Use SSA US baby names to train a model to predict gender

Use caret package to train classic ML model

Max Kuhn has detailed documentation related to how use caret pacakge to train and tune models. You can check here for more details: https://topepo.github.io/caret/.

You can also get his book on Applied Predicting Modeling: http://appliedpredictivemodeling.com/

In this tutorial, we are going to rely on Max Kuhn’s caret package to train a simple classifier to predict gender.

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting; pre-processing; feature selection; model tuning using resampling; variable importance estimation

Of course, there are other packages that you can use to train your own models, but I prefer using caret as it is one of the most popular in our community.

Here our goal is to use data from SSA to train a gender classifier using name features.

Data Preparation

You can download ssa baby names via here https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data

I have also used the following codes to preprocess the data. I have generated aggregated name counts by gender and then define our target outcome as “female if over 50 of women use that name.” I have also created four variables: first letter, first two letters, last letter, and laster two letters. You can directly use the RData file, https://yongjunzhang.com/files/ssa_baby_names.RData.

Of course, you can take a more rigorous way to define the gender of names.

files <- list.files(path = "./names/",pattern = ".txt",full.names = TRUE)
files
##   [1] "./names//yob1880.txt" "./names//yob1881.txt" "./names//yob1882.txt"
##   [4] "./names//yob1883.txt" "./names//yob1884.txt" "./names//yob1885.txt"
##   [7] "./names//yob1886.txt" "./names//yob1887.txt" "./names//yob1888.txt"
##  [10] "./names//yob1889.txt" "./names//yob1890.txt" "./names//yob1891.txt"
##  [13] "./names//yob1892.txt" "./names//yob1893.txt" "./names//yob1894.txt"
##  [16] "./names//yob1895.txt" "./names//yob1896.txt" "./names//yob1897.txt"
##  [19] "./names//yob1898.txt" "./names//yob1899.txt" "./names//yob1900.txt"
##  [22] "./names//yob1901.txt" "./names//yob1902.txt" "./names//yob1903.txt"
##  [25] "./names//yob1904.txt" "./names//yob1905.txt" "./names//yob1906.txt"
##  [28] "./names//yob1907.txt" "./names//yob1908.txt" "./names//yob1909.txt"
##  [31] "./names//yob1910.txt" "./names//yob1911.txt" "./names//yob1912.txt"
##  [34] "./names//yob1913.txt" "./names//yob1914.txt" "./names//yob1915.txt"
##  [37] "./names//yob1916.txt" "./names//yob1917.txt" "./names//yob1918.txt"
##  [40] "./names//yob1919.txt" "./names//yob1920.txt" "./names//yob1921.txt"
##  [43] "./names//yob1922.txt" "./names//yob1923.txt" "./names//yob1924.txt"
##  [46] "./names//yob1925.txt" "./names//yob1926.txt" "./names//yob1927.txt"
##  [49] "./names//yob1928.txt" "./names//yob1929.txt" "./names//yob1930.txt"
##  [52] "./names//yob1931.txt" "./names//yob1932.txt" "./names//yob1933.txt"
##  [55] "./names//yob1934.txt" "./names//yob1935.txt" "./names//yob1936.txt"
##  [58] "./names//yob1937.txt" "./names//yob1938.txt" "./names//yob1939.txt"
##  [61] "./names//yob1940.txt" "./names//yob1941.txt" "./names//yob1942.txt"
##  [64] "./names//yob1943.txt" "./names//yob1944.txt" "./names//yob1945.txt"
##  [67] "./names//yob1946.txt" "./names//yob1947.txt" "./names//yob1948.txt"
##  [70] "./names//yob1949.txt" "./names//yob1950.txt" "./names//yob1951.txt"
##  [73] "./names//yob1952.txt" "./names//yob1953.txt" "./names//yob1954.txt"
##  [76] "./names//yob1955.txt" "./names//yob1956.txt" "./names//yob1957.txt"
##  [79] "./names//yob1958.txt" "./names//yob1959.txt" "./names//yob1960.txt"
##  [82] "./names//yob1961.txt" "./names//yob1962.txt" "./names//yob1963.txt"
##  [85] "./names//yob1964.txt" "./names//yob1965.txt" "./names//yob1966.txt"
##  [88] "./names//yob1967.txt" "./names//yob1968.txt" "./names//yob1969.txt"
##  [91] "./names//yob1970.txt" "./names//yob1971.txt" "./names//yob1972.txt"
##  [94] "./names//yob1973.txt" "./names//yob1974.txt" "./names//yob1975.txt"
##  [97] "./names//yob1976.txt" "./names//yob1977.txt" "./names//yob1978.txt"
## [100] "./names//yob1979.txt" "./names//yob1980.txt" "./names//yob1981.txt"
## [103] "./names//yob1982.txt" "./names//yob1983.txt" "./names//yob1984.txt"
## [106] "./names//yob1985.txt" "./names//yob1986.txt" "./names//yob1987.txt"
## [109] "./names//yob1988.txt" "./names//yob1989.txt" "./names//yob1990.txt"
## [112] "./names//yob1991.txt" "./names//yob1992.txt" "./names//yob1993.txt"
## [115] "./names//yob1994.txt" "./names//yob1995.txt" "./names//yob1996.txt"
## [118] "./names//yob1997.txt" "./names//yob1998.txt" "./names//yob1999.txt"
## [121] "./names//yob2000.txt" "./names//yob2001.txt" "./names//yob2002.txt"
## [124] "./names//yob2003.txt" "./names//yob2004.txt" "./names//yob2005.txt"
## [127] "./names//yob2006.txt" "./names//yob2007.txt" "./names//yob2008.txt"
## [130] "./names//yob2009.txt" "./names//yob2010.txt" "./names//yob2011.txt"
## [133] "./names//yob2012.txt" "./names//yob2013.txt" "./names//yob2014.txt"
## [136] "./names//yob2015.txt" "./names//yob2016.txt" "./names//yob2017.txt"
## [139] "./names//yob2018.txt" "./names//yob2019.txt" "./names//yob2020.txt"
# let us read the first file
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## âś“ ggplot2 3.3.5     âś“ purrr   0.3.4
## âś“ tibble  3.1.6     âś“ dplyr   1.0.7
## âś“ tidyr   1.1.3     âś“ stringr 1.4.0
## âś“ readr   2.1.1     âś“ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# let us say we want to define a read_us_baby function
readSsaBabyNames <- function(file,...){
  require(tidyverse)
  data <- read_csv(file,col_names = FALSE,show_col_types = FALSE) %>% 
    mutate(year=str_replace_all(file,"[^0-9]",""))
  colnames(data) <- c("names","sex","count","year")
  return(data)
}

dat_year <- map_dfr(files, readSsaBabyNames)

# we only use names after 1970 and used by at least 10
dat_all <- dat_year %>%
  filter(year>1970) %>% 
  group_by(names,sex) %>% 
  summarise(count=sum(count)) %>% 
  # we only keep names used by at least 10 
  filter(count>10) %>% 
  pivot_wider(names_from = "sex",values_from="count") %>% 
  replace_na(list(F=0,M=0)) %>% 
  mutate(female=(F/(F+M)>.5)*1) %>% 
  mutate(flt1= str_extract(names,"^.")%>% tolower,
         flt2= str_extract(names,"^.{2}") %>% tolower,
         llt1= str_extract(names,".$")%>% tolower,
         llt2= str_extract(names,".{2}$")%>% tolower
         )
## `summarise()` has grouped output by 'names'. You can override using the `.groups` argument.
save(dat_year,dat_all,file="./ssa_baby_names.RData")

Split our data into train and test data

Before we further split our data, let us take a look at data first.

top_letters <- dat_all %>% 
  group_by(flt1) %>% 
  summarise(fl1=n()) %>% 
  top_n(n=5) %>% 
  bind_cols(
    dat_all %>% 
      group_by(flt2) %>% 
      summarise(fl2=n()) %>% 
      top_n(n=5)
  ) %>% 
  bind_cols(
    dat_all %>% 
      group_by(llt1) %>% 
      summarise(ll1=n()) %>% 
      top_n(n=5)
  ) %>% 
  bind_cols(
    dat_all %>% 
      group_by(llt2) %>% 
      summarise(ll2=n()) %>% 
      top_n(n=5)
  )
## Selecting by fl1
## Selecting by fl2
## Selecting by ll1
## Selecting by ll2
knitr::kable(top_letters)
flt1 fl1 flt2 fl2 llt1 ll1 llt2 ll2
a 7685 da 1701 a 19482 ah 3136
j 5755 ja 2904 e 9336 an 2552
k 5694 ka 2183 h 4387 ia 3242
m 5173 ma 2884 i 4014 na 3915
s 5659 sh 2352 n 11613 on 2741

We are going to create a series of features based on first/first two and last/last two letters to predict gender of names. But in your own training, you should do some feature engineering and choose those most informative features.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#create dummies for last letter, last two letters, fist letter, and first two letters of names
fllt_d=predict(dummyVars(~llt1+llt2+flt1+flt2,data=dat_all),newdata=dat_all)%>%as.data.frame()
# This will create a over 900 vars, and some of them are near zero variance, so we need to get rid of it.
#identify near zero variance variables
nzv_test <- nearZeroVar(fllt_d,freqCut = 95/5, uniqueCut = 10, saveMetrics= TRUE,allowParallel = TRUE)
nzv <- nearZeroVar(fllt_d,freqCut = 95/5, uniqueCut = 10,allowParallel = TRUE)
fllt_nnzv<- fllt_d[, -nzv]
#create the full datasets with dummies
df=dat_all %>% 
  ungroup %>% 
  select(names,female) %>% 
  bind_cols(fllt_nnzv) %>% 
  filter(!is.na(female)) %>% 
  mutate(female=ifelse(female==1,"Y","N") %>% as.factor())

Let us finally split our data into train and test

inTrain <- createDataPartition(
  y = df$female,
  ## the outcome data are needed
  p = .75,
  ## The percentage of data in the
  ## training set
  list = FALSE
)

train <- df[ inTrain,]
test  <- df[-inTrain,]

nrow(train)
## [1] 50610
#> [1] 157
nrow(test)
## [1] 16869
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
## 
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last
## The following object is masked from 'package:purrr':
## 
##     keep
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
## The following object is masked from 'package:base':
## 
##     startsWith
keep(dat_all,df,train,test,sure=TRUE)

Let us train a xgboost model to predict gender

You can check caret package’s boosting methods here https://topepo.github.io/caret/train-models-by-tag.html#boosting. Of course, you can fine-tune hyper-parameters, but we use the default ones.

# K folds cross validation
# try parallel computing
library(doParallel)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: iterators
## Loading required package: parallel
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
cl <- makePSOCKcluster(3)
registerDoParallel(cl)

grid_default <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

train_control <- caret::trainControl(
  method = "cv",
  number = 3,
  verboseIter = FALSE, 
  allowParallel = TRUE 
)

xgb_base <- caret::train(
  female~., 
  data=train %>%  dplyr::select(-c(names)),
  trControl = train_control,
  tuneGrid = grid_default,
  method = "xgbTree",
  verbose = TRUE
)

stopCluster(cl)

save(xgb_base,file="./xgb_base.RData")

Let us check model performance, get the comfusion matrix firrst

load("./xgb_base.RData")
# check cf matrix
xgb_base
## eXtreme Gradient Boosting 
## 
## 50610 samples
##    16 predictor
##     2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 33740, 33740, 33740 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8029243  0.5822704
## 
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
##  held constant at a value of 1
## Tuning parameter 'subsample' was held
##  constant at a value of 1
confusionMatrix(xgb_base)
## Cross-Validated (3 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    N    Y
##          N 28.3 10.2
##          Y  9.5 52.0
##                             
##  Accuracy (average) : 0.8029

Let us check the most informative features

xgbImp <- varImp(xgb_base,scale = TRUE)
plot(xgbImp, top = 5)

Test model performance on test set

# predict name_remove
test_df <- test %>% dplyr::select(-c(names,female))
test_pred <- predict(xgb_base,newdata=test_df) %>% as.data.frame()
colnames(test_pred) <- "xgb_female"

data <- test %>% select(names,female) %>% bind_cols(test_pred)

data %>% 
  write_csv("./test_pred.csv",na="")

# check confusion matrix

confusionMatrix(test_pred$xgb_female,test$female)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    Y
##          N 4808 1683
##          Y 1570 8808
##                                           
##                Accuracy : 0.8072          
##                  95% CI : (0.8011, 0.8131)
##     No Information Rate : 0.6219          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5914          
##                                           
##  Mcnemar's Test P-Value : 0.04956         
##                                           
##             Sensitivity : 0.7538          
##             Specificity : 0.8396          
##          Pos Pred Value : 0.7407          
##          Neg Pred Value : 0.8487          
##              Prevalence : 0.3781          
##          Detection Rate : 0.2850          
##    Detection Prevalence : 0.3848          
##       Balanced Accuracy : 0.7967          
##                                           
##        'Positive' Class : N               
## 

Code Challenge 1

Replicate the ML predicting baby names’ gender using top 5 most frequent first/last and first-two/last two letters instead of all dummies

Try other models, including support vector machine, penalized logit model, and naive bayes; You can also try random forest, but it takes a while to finish.

Generate your final prediction based on svm, logit, and nb methods (using the majority vote)

Save a csv file and report the confusion matrix

You codes should be written in Rmarkdown file and genearte a pdf file.

Email me before next Monday class.