Our Goal

This tutorial aims to help students build some basic knowledge of using R, including how to use Rstudio to create a project, version control management using git, how to access R via different ways (terminal, RGui, Rstudio, VScode, etc.), how to read and save files in R, how to do basic data wrangling, and understand basic regular expression.

This tutorial is built using Rmarkdown and the raw file is here:

Recap: Install R and Rstudio

If you still have not installed R and Rstudio, you shall follow the steps below. We are using Rstudio in this tutorial. Rstudio not only support R language, but also python, etc.

Before installing Rstudio, you need to Install R first. You can download R and other necessary software by clicking here https://cran.r-project.org/. You can choose the appropriate version for your system (e.g., windows, Mac). Be careful to follow its installing instruction, especially regarding those necessary software like xquartz if you are a mac user.

After this step, please go to RStudio website to download and install RStudio desktop. You can click here https://rstudio.com/products/rstudio/download/ and choose the free version.

Then, you can use install.packages function to install necessary R packages. But I suggest that you copy and paste following code to install some common packages for data processing and visualization. You can add any packages you want to install by defining the “packages” variable.

Packages are developed by the R user community to achieve different goals. For instance, tidyverse is a wrapper for a lot of different packages designed for data science, including dplyr, tidyr, readr,purr, tibble, etc. Click here for more details https://www.tidyverse.org/packages/

# pacman is a package help you manage packages: load installed packages; or install and load packages if you have not installed them
# you can use install.packages("package_name") to install necessary packages
# in this tutorial, we need the following packages:
if (!requireNamespace("pacman"))
  install.packages('pacman')
library(pacman)
packages<-c("tidyverse","haven","xlsx")
p_load(packages,character.only = TRUE)

Some Basics about R

Access R

To use R, there are multiple ways.

  • You can use RGui, like this:

RGui

  • You can use terminal or cmdline to access R, like this:

Terminal,cmdline

  • Of course, we can use Rstudio to access R, obviously.

Rstudio

When we use Rstudio, you can use the right upper panel (console) to use R, but you can also use the left panel by openning a R script or R markdown to use R. I strongly suggest that you should write R script or Rmarkdown files to execute your codes instead of using R console directly.

Manage Your Codes via Rstudio Project

We often develop our own habits and systems to organize our project and files. But in this course, I suggest that we use Rstudio project to organize our codes. We clike File/New Project/New Directory, like this:

Rstudio Project

The great thing with Rstudio project is you can use it with git version control system to track your codes. Another advantage of using Rstudio project is that it is super convenient to share your codes. You can share the entire project folder with your collaborators and they don’t need to change anything. Just click the proj file to start Rstudio and run your codes on their end to replicate what you have done.

Next we will use Rstudio to access R and run our demos.

Manage your libraries

R is a great open source software for statistical computing and graphic visualization. I use R for data wrangling and data visualization. The advantage of R is that it has a great community with users contributing to R community. A lot of user-written packages like tidyverse and ggplot2 which allow users to conveniently achieve their goals.

To use R on top of its base, we need to install packages for specific tasks. There are different ways to install R packages. But let us figure out where your library is (storing your packages) and see what packages have been loaded into your environment.

Check your library

.libPaths() # get library location
## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library"
library()   # see all packages installed
search()    # see packages currently loaded
##  [1] ".GlobalEnv"        "package:xlsx"      "package:haven"    
##  [4] "package:forcats"   "package:stringr"   "package:dplyr"    
##  [7] "package:purrr"     "package:readr"     "package:tidyr"    
## [10] "package:tibble"    "package:ggplot2"   "package:tidyverse"
## [13] "package:pacman"    "package:knitr"     "package:stats"    
## [16] "package:graphics"  "package:grDevices" "package:utils"    
## [19] "package:datasets"  "package:methods"   "Autoloads"        
## [22] "package:base"

Install your packages

We can use the function install.packages() to install your target package.

Let us say, we want our tidyverse… It takes a while, since it will install all these dependencies.

install.packages("tidyverse")

You can also use devtools to install packages from github, etc.

devtools::install_github("Your target package path")

Of course, you can use pacman package as shown before.

check your current working directory

If you are not using Rstudio project to organize your codes, then typically what you have to do is to check your working directory. Now your working directory should be under your R project folder.

getwd()
## [1] "/Volumes/GoogleDrive/My Drive/SBU/Teaching/Spring2022/CSS/CSS-Tutorial/Lab2"

You can use setwd() to change your working directory

# not running the code
wdir <- "Your Dir"
setwd(wdir)

Now you have some basic understanding about how to start your R. Next let us move to introduce some basic R functions.

Long time ago, I wrote some a post on how to write good r code. You can take a look here:

https://yongjunzhang.com/posts/2018/02/google-r-style-guide/

Assign Values to Variables

In R, you can use the assignment operator <- to create new variables. I also see people use the equal sign (“=”) to do that. But for code consistency, we should stick to the <- sign.

If you are using windows system, you can press “alt” and “-” simultaneously to type <- In mac, I believe it is opt and -

# One example of assigning values to variables

# let us create a variable "a" and assign a value of 5 to it.
a <- 5
a
## [1] 5
# let us create a variable b and assign a value 100 to it

b <- 100
b
## [1] 100
# you can also assign a string

c <- "hello world"
c
## [1] "hello world"

Arithmetic and Logical Operators

Arithmetic Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
Logical Operator Description
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to

Let us see some examples

# we can do some operation in R
# let us create a new variable d that stores a and b
d <- a + b
d
## [1] 105
# let do other arithmetic operations
e <- a - b
e
## [1] -95
f <- a * b
f
## [1] 500
g <- a ** b
g
## [1] 7.888609e+69
# but you cannot do this for different data types
# h <- a + c
# h

# let us try some logical operation
# Note that a = 5
a == 5
## [1] TRUE
a <= 100
## [1] TRUE
a !=100
## [1] TRUE
# as you see here, it returns TRUE OR FALSE

Data Types

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Types Examples
scalars a <- 1; b<-“hello word”; c <- (a ==1)
vectors a <- c(1, 2, 3, 4); b <- c (“a”, “b”, “c”)
matrices

let us see a 2 by 3 matrix:

a <- matrix(c(2,4,3,1,5,7),#the data elements
nrow=2,#number of rows
ncol=3,#number of columns
byrow=TRUE) #fill matrix by rows

data frames

A data frame is used for storing data tables. It is a list of vectors of equal length. For example, df is a data frame containing three vectors n, s,b.

n=c(2,3,5)
s=c(“aa”,“bb”,“cc”)
b=c(TRUE,FALSE,TRUE)
df=data.frame(n,s,b)

lists

Lists contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using list() function.

list_example<- list(“Red”, “Green”, c(1,2,3), TRUE, 520, 10000)

let us see some examples:

# scalars
a <-  2
class(a)
## [1] "numeric"
# vectors
b <- c("a","b","c")
b
## [1] "a" "b" "c"
class(b)
## [1] "character"
# matrix
c <- matrix(c(2,4,3,1,5,7),nrow=2, ncol=3,byrow=TRUE)
c
##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7
class(c)
## [1] "matrix" "array"
# data frame
n <- c(2,3,5)
s <- c("aa","bb","cc")
b <- c(TRUE,FALSE,TRUE)
df <- data.frame(n,s,b)
df
##   n  s     b
## 1 2 aa  TRUE
## 2 3 bb FALSE
## 3 5 cc  TRUE
class(df)
## [1] "data.frame"
# lists
list_example <- list("Red", "Green", c(1,2,3), TRUE, 520, 10000)
list_example
## [[1]]
## [1] "Red"
## 
## [[2]]
## [1] "Green"
## 
## [[3]]
## [1] 1 2 3
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] 520
## 
## [[6]]
## [1] 10000
class(list_example)
## [1] "list"

Functions

In R you can define a function to achieve certain goals.

# let us define a function that prints out the texts you input...

print_text <- function(x){
  print(x)
}

x <- "Hello Word, R."
print_text(x)
## [1] "Hello Word, R."

One rule to beautify your codes is to abstract your codes using R functions. R has many built-in functions (R base), but you can also get a lot more functions by installing new packages.

Let us write a function of computing covariance of two variables.

CalculateSampleCovariance <- function(x, y, verbose = TRUE) {
  # Computes the sample covariance between two vectors.
  #
  # Args:
  #   x: One of two vectors whose sample covariance is to be calculated.
  #   y: The other vector. x and y must have the same length, greater than one,
  #      with no missing values.
  #   verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE.
  #
  # Returns:
  #   The sample covariance between x and y.
  n <- length(x)
  # Error handling
  if (n <= 1 || n != length(y)) {
    stop("Arguments x and y have different lengths: ",
         length(x), " and ", length(y), ".")
  }
  if (TRUE %in% is.na(x) || TRUE %in% is.na(y)) {
    stop(" Arguments x and y must not have missing values.")
  }
  covariance <- var(x, y)
  if (verbose)
    cat("Covariance = ", round(covariance, 4), ".\n", sep = "")
  return(covariance)
}

getAnywhere(CalculateSampleCovariance)
## A single object matching 'CalculateSampleCovariance' was found
## It was found in the following places
##   .GlobalEnv
## with value
## 
## function(x, y, verbose = TRUE) {
##   # Computes the sample covariance between two vectors.
##   #
##   # Args:
##   #   x: One of two vectors whose sample covariance is to be calculated.
##   #   y: The other vector. x and y must have the same length, greater than one,
##   #      with no missing values.
##   #   verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE.
##   #
##   # Returns:
##   #   The sample covariance between x and y.
##   n <- length(x)
##   # Error handling
##   if (n <= 1 || n != length(y)) {
##     stop("Arguments x and y have different lengths: ",
##          length(x), " and ", length(y), ".")
##   }
##   if (TRUE %in% is.na(x) || TRUE %in% is.na(y)) {
##     stop(" Arguments x and y must not have missing values.")
##   }
##   covariance <- var(x, y)
##   if (verbose)
##     cat("Covariance = ", round(covariance, 4), ".\n", sep = "")
##   return(covariance)
## }
(x <- seq(1,10,1))
##  [1]  1  2  3  4  5  6  7  8  9 10
(y <- seq(21,40,2))
##  [1] 21 23 25 27 29 31 33 35 37 39
CalculateSampleCovariance(x,y,verbose = TRUE)
## Covariance = 18.3333.
## [1] 18.33333

So a typical function consits of the following elements:

FunctionName <- function(arg1,arg2,arg3=TRUE,...){
  # MAIN FUNCTION BODY
  # You Do SOMETHING HERE
  # Return Your Value or Values
}
  • Functions should contain a comments section immediately below the function definition line. These comments should consist of a one-sentence description of the function; a list of the function’s arguments, denoted by Args:, with a description of each (including the data type); and a description of the return value, denoted by Returns:. The comments should be descriptive enough that a caller can use the function without reading any of the function’s code.

  • FunctionName: You have to assign a function name to your function. You can reference my post to see how to write consistent R codes.

  • arg1, arg2, arg3: these are your function’s arguments. Could be any R objects such as numbers, strings, arrays, data frames, and other valid arguments.

  • Some arguments have default values, for instance, arg. You have to supply a value for arguments without a default when you run.

  • The ‘…’ argument: We can also use …, or ellipsis, in the function, which allows us pass arguments to other funsions you are using.

  • Main Function body: The codes specified in {} are called everytime when you running it. It could be very long or short. Also you can use other functions in your function.

  • Return value: The last line of the code is the value that will be returned by the function.

Use tryCatch to manage errors in expression

tryCatch() evaluates an expression to catch exceptions. The tryCatch() function allows R users to handle errors like this: if(error), then(do this).

The basic syntax is:

tryCatch({
# Parameters
# expression: The expression to be evaluated.
# …: A catch list of named expressions. The expression with the same name as the class of the Exception thrown when evaluating an expression.
# finally: An expression that is guaranteed to be called even if the expression generates an exception.
# envir: The environment in which the caught expression is to be evaluated.
 expression # you do something here, e.g., your function
}, warning = function(w) {
 warning-handler-code
}, error = function(e) {
 error-handler-code
}, finally = {
 cleanup-code
}
)

Let us use our specified function as an example:

(x <- seq(1,10,1))
(y <- seq(100,200,2))
CalculateSampleCovariance(x,y,verbose = TRUE)
#After we run this code, it reports errors. We have a line of code to handle the length inconsistency and stop the function. But if you are running a large chunk of the codes and you don't want your codes to break... 
(x <- seq(1,10,1))
##  [1]  1  2  3  4  5  6  7  8  9 10
(y <- seq(100,200,2))
##  [1] 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
## [20] 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174
## [39] 176 178 180 182 184 186 188 190 192 194 196 198 200
tryCatch({CalculateSampleCovariance(x,y,verbose = TRUE)},
 error = function(e) print("You can't calculate covariance with two different length vectors"))
## [1] "You can't calculate covariance with two different length vectors"
# the code does not break

Loops

for loop

A for loop is used to apply the same function calls to a collection of objects. R has a family of loops. Usually we should avoid for loop when you are coding. But in this tutorial, we will use for-loop as an example.

# let us say you have a list of texts that you want to print out.
texts <- rep("this is a test",10)
# now you want to use print function to print out the content..
# one way you can do is to mannually specify all texts
# like print(texts[1])... but it is tedious...
# we can do a for loop to print out all texts easily

for(text in texts){
  print(text)
}
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"
## [1] "this is a test"

we should avoid loop functions if possible as it is every expensive. Of course, you can also use apply family to avoid loops. But we are not going to discuss it. I prefer using map family functions.

map family

Let us use reading files as a demo. If you need to read a list of files into R from a folder, so here is what you want to do:

  • get a list of files that needs to be read
  • read the first file
  • read the second file
  • read the nth file

But this is very redundant and time consuming.

Here is what you are actually doing.

  • get a list of files that needs to be read
  • define a function of reading the file
  • apply the function to all the files.

Let us read U.S. SSA Baby Names

US Baby Names Example

files <- list.files(path = "./names/",pattern = ".txt",full.names = TRUE)
files
##   [1] "./names//yob1880.txt" "./names//yob1881.txt" "./names//yob1882.txt"
##   [4] "./names//yob1883.txt" "./names//yob1884.txt" "./names//yob1885.txt"
##   [7] "./names//yob1886.txt" "./names//yob1887.txt" "./names//yob1888.txt"
##  [10] "./names//yob1889.txt" "./names//yob1890.txt" "./names//yob1891.txt"
##  [13] "./names//yob1892.txt" "./names//yob1893.txt" "./names//yob1894.txt"
##  [16] "./names//yob1895.txt" "./names//yob1896.txt" "./names//yob1897.txt"
##  [19] "./names//yob1898.txt" "./names//yob1899.txt" "./names//yob1900.txt"
##  [22] "./names//yob1901.txt" "./names//yob1902.txt" "./names//yob1903.txt"
##  [25] "./names//yob1904.txt" "./names//yob1905.txt" "./names//yob1906.txt"
##  [28] "./names//yob1907.txt" "./names//yob1908.txt" "./names//yob1909.txt"
##  [31] "./names//yob1910.txt" "./names//yob1911.txt" "./names//yob1912.txt"
##  [34] "./names//yob1913.txt" "./names//yob1914.txt" "./names//yob1915.txt"
##  [37] "./names//yob1916.txt" "./names//yob1917.txt" "./names//yob1918.txt"
##  [40] "./names//yob1919.txt" "./names//yob1920.txt" "./names//yob1921.txt"
##  [43] "./names//yob1922.txt" "./names//yob1923.txt" "./names//yob1924.txt"
##  [46] "./names//yob1925.txt" "./names//yob1926.txt" "./names//yob1927.txt"
##  [49] "./names//yob1928.txt" "./names//yob1929.txt" "./names//yob1930.txt"
##  [52] "./names//yob1931.txt" "./names//yob1932.txt" "./names//yob1933.txt"
##  [55] "./names//yob1934.txt" "./names//yob1935.txt" "./names//yob1936.txt"
##  [58] "./names//yob1937.txt" "./names//yob1938.txt" "./names//yob1939.txt"
##  [61] "./names//yob1940.txt" "./names//yob1941.txt" "./names//yob1942.txt"
##  [64] "./names//yob1943.txt" "./names//yob1944.txt" "./names//yob1945.txt"
##  [67] "./names//yob1946.txt" "./names//yob1947.txt" "./names//yob1948.txt"
##  [70] "./names//yob1949.txt" "./names//yob1950.txt" "./names//yob1951.txt"
##  [73] "./names//yob1952.txt" "./names//yob1953.txt" "./names//yob1954.txt"
##  [76] "./names//yob1955.txt" "./names//yob1956.txt" "./names//yob1957.txt"
##  [79] "./names//yob1958.txt" "./names//yob1959.txt" "./names//yob1960.txt"
##  [82] "./names//yob1961.txt" "./names//yob1962.txt" "./names//yob1963.txt"
##  [85] "./names//yob1964.txt" "./names//yob1965.txt" "./names//yob1966.txt"
##  [88] "./names//yob1967.txt" "./names//yob1968.txt" "./names//yob1969.txt"
##  [91] "./names//yob1970.txt" "./names//yob1971.txt" "./names//yob1972.txt"
##  [94] "./names//yob1973.txt" "./names//yob1974.txt" "./names//yob1975.txt"
##  [97] "./names//yob1976.txt" "./names//yob1977.txt" "./names//yob1978.txt"
## [100] "./names//yob1979.txt" "./names//yob1980.txt" "./names//yob1981.txt"
## [103] "./names//yob1982.txt" "./names//yob1983.txt" "./names//yob1984.txt"
## [106] "./names//yob1985.txt" "./names//yob1986.txt" "./names//yob1987.txt"
## [109] "./names//yob1988.txt" "./names//yob1989.txt" "./names//yob1990.txt"
## [112] "./names//yob1991.txt" "./names//yob1992.txt" "./names//yob1993.txt"
## [115] "./names//yob1994.txt" "./names//yob1995.txt" "./names//yob1996.txt"
## [118] "./names//yob1997.txt" "./names//yob1998.txt" "./names//yob1999.txt"
## [121] "./names//yob2000.txt" "./names//yob2001.txt" "./names//yob2002.txt"
## [124] "./names//yob2003.txt" "./names//yob2004.txt" "./names//yob2005.txt"
## [127] "./names//yob2006.txt" "./names//yob2007.txt" "./names//yob2008.txt"
## [130] "./names//yob2009.txt" "./names//yob2010.txt" "./names//yob2011.txt"
## [133] "./names//yob2012.txt" "./names//yob2013.txt" "./names//yob2014.txt"
## [136] "./names//yob2015.txt" "./names//yob2016.txt" "./names//yob2017.txt"
## [139] "./names//yob2018.txt" "./names//yob2019.txt" "./names//yob2020.txt"
# let us read the first file
library(tidyverse)
baby1 <- read_csv(files[1],col_names = FALSE)
head(baby1)
## # A tibble: 6 × 3
##   X1        X2       X3
##   <chr>     <chr> <dbl>
## 1 Mary      F      7065
## 2 Anna      F      2604
## 3 Emma      F      2003
## 4 Elizabeth F      1939
## 5 Minnie    F      1746
## 6 Margaret  F      1578
colnames(baby1) <- c("names","sex","count")

# let us say we want to define a read_us_baby function

readSsaBabyNames <- function(file,...){
  require(tidyverse)
  data <- read_csv(file,col_names = FALSE,show_col_types = FALSE) %>% 
    mutate(year=str_replace_all(file,"[^0-9]",""))
  colnames(data) <- c("names","sex","count","year")
  return(data)
}
readSsaBabyNames(files[1])
## # A tibble: 2,000 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      7065 1880 
##  2 Anna      F      2604 1880 
##  3 Emma      F      2003 1880 
##  4 Elizabeth F      1939 1880 
##  5 Minnie    F      1746 1880 
##  6 Margaret  F      1578 1880 
##  7 Ida       F      1472 1880 
##  8 Alice     F      1414 1880 
##  9 Bertha    F      1320 1880 
## 10 Sarah     F      1288 1880 
## # … with 1,990 more rows
start_time <- Sys.time()
dat_baby <- map_dfr(files, readSsaBabyNames)
end_time <- Sys.time()
(end_time - start_time)
## Time difference of 4.17626 secs

You can check here to see a more detailed description of map function: https://purrr.tidyverse.org/reference/map.html

Parallelising your code

require(parallel)
no_cores <- detectCores() - 1
start_time <- Sys.time()
dat_baby <- mclapply(files,readSsaBabyNames,mc.cores = no_cores) #same thing using all the cores in your machine
end_time <- Sys.time()
(end_time - start_time)
## Time difference of 2.002082 secs
head(dat_baby)
## [[1]]
## # A tibble: 2,000 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      7065 1880 
##  2 Anna      F      2604 1880 
##  3 Emma      F      2003 1880 
##  4 Elizabeth F      1939 1880 
##  5 Minnie    F      1746 1880 
##  6 Margaret  F      1578 1880 
##  7 Ida       F      1472 1880 
##  8 Alice     F      1414 1880 
##  9 Bertha    F      1320 1880 
## 10 Sarah     F      1288 1880 
## # … with 1,990 more rows
## 
## [[2]]
## # A tibble: 1,934 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      6919 1881 
##  2 Anna      F      2698 1881 
##  3 Emma      F      2034 1881 
##  4 Elizabeth F      1852 1881 
##  5 Margaret  F      1658 1881 
##  6 Minnie    F      1653 1881 
##  7 Ida       F      1439 1881 
##  8 Annie     F      1326 1881 
##  9 Bertha    F      1324 1881 
## 10 Alice     F      1308 1881 
## # … with 1,924 more rows
## 
## [[3]]
## # A tibble: 2,127 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      8148 1882 
##  2 Anna      F      3143 1882 
##  3 Emma      F      2303 1882 
##  4 Elizabeth F      2186 1882 
##  5 Minnie    F      2004 1882 
##  6 Margaret  F      1821 1882 
##  7 Ida       F      1673 1882 
##  8 Alice     F      1542 1882 
##  9 Bertha    F      1508 1882 
## 10 Annie     F      1492 1882 
## # … with 2,117 more rows
## 
## [[4]]
## # A tibble: 2,084 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      8012 1883 
##  2 Anna      F      3306 1883 
##  3 Emma      F      2367 1883 
##  4 Elizabeth F      2255 1883 
##  5 Minnie    F      2035 1883 
##  6 Margaret  F      1881 1883 
##  7 Bertha    F      1681 1883 
##  8 Ida       F      1634 1883 
##  9 Annie     F      1589 1883 
## 10 Clara     F      1548 1883 
## # … with 2,074 more rows
## 
## [[5]]
## # A tibble: 2,297 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      9217 1884 
##  2 Anna      F      3860 1884 
##  3 Emma      F      2587 1884 
##  4 Elizabeth F      2549 1884 
##  5 Minnie    F      2243 1884 
##  6 Margaret  F      2142 1884 
##  7 Ida       F      1882 1884 
##  8 Clara     F      1852 1884 
##  9 Bertha    F      1789 1884 
## 10 Annie     F      1739 1884 
## # … with 2,287 more rows
## 
## [[6]]
## # A tibble: 2,294 × 4
##    names     sex   count year 
##    <chr>     <chr> <dbl> <chr>
##  1 Mary      F      9128 1885 
##  2 Anna      F      3994 1885 
##  3 Emma      F      2728 1885 
##  4 Elizabeth F      2582 1885 
##  5 Margaret  F      2204 1885 
##  6 Minnie    F      2178 1885 
##  7 Clara     F      1910 1885 
##  8 Bertha    F      1860 1885 
##  9 Ida       F      1854 1885 
## 10 Annie     F      1703 1885 
## # … with 2,284 more rows

Pipe coding

Next we are going to discuss pipe coding in R. The pipe operator, %>%, has been a longstanding feature of the magrittr package for R. You can check here:https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html The following codes are from magrittr readme.rmd. ### Basic piping

  • x %>% f is equivalent to f(x)
  • x %>% f(y) is equivalent to f(x, y)
  • x %>% f %>% g %>% h is equivalent to h(g(f(x)))

Here, “equivalent” is not technically exact: evaluation is non-standard, and the left-hand side is evaluated before passed on to the right-hand side expression. However, in most cases this has no practical implication.

The argument placeholder

  • x %>% f(y, .) is equivalent to f(y, x)
  • x %>% f(y, z = .) is equivalent to f(y, z = x)

Re-using the placeholder for attributes

It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))

The behavior can be overruled by enclosing the right-hand side in braces:

x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))

Building (unary) functions

Any pipeline starting with the . will return a function which can later be used to apply the pipeline to values. Building functions in magrittr is therefore similar to building other values.

f <- . %>% cos %>% sin 
# is equivalent to 
f <- function(.) sin(cos(.)) 

A short demo to process data using piping

# Let us use baby names as an example
# here is our codes:
# we first read all txt files into R
# we want to lowercase all names
# we want to get the overall counts for each name
# we sort names from largest to smallest
# we assign the results to baby_name_counts ( a data frame)

baby_name_counts <- map_dfr(files, readSsaBabyNames) %>% 
  mutate(names = tolower(names)) %>% 
  group_by(names) %>% 
  summarise(counts=sum(count)) %>% 
  arrange(desc(counts))
head(baby_name_counts)
## # A tibble: 6 × 2
##   names    counts
##   <chr>     <dbl>
## 1 james   5213689
## 2 john    5163958
## 3 robert  4849738
## 4 michael 4405274
## 5 william 4159868
## 6 mary    4145486
# your codes are more readable???

Read and Save Files in R

Now let us talk about how to read and save files in R, though we have covered some,like using read_csv function from tidyverse to read comma separated files. Tidyverse has a series of functions to read files, such as read_tsv, read_delim, etc; R base has a sereis of functions as well, like read.csv, read.table; foreign package has functions like read.spss, read.data. Haven packages has also some useful functions. Let us go through some functions

You can click here for more details: https://readr.tidyverse.org/reference/read_delim.html

Load R objectives

You can use load() function to read r data files

# I saved ssa baby names in a RData file

load("./ssa_baby.RData") # the input is dat_baby with four variables, including names, sex, count, and year.

Read Delimited Files

Tidyverse has read_delim and read_csv functions.

# Arguments
# file: Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded. Remote gz files can also be automatically downloaded and decompressed.
# delim: Single character used to separate fields within a record.
# quote: Single character used to quote strings.
read_delim(
  file,
  delim = NULL,
  quote = "\"",
  escape_backslash = FALSE,
  escape_double = TRUE,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  id = NULL,
  locale = default_locale(),
  na = c("", "NA"),
  quoted_na = TRUE,
  comment = "",
  trim_ws = FALSE,
  skip = 0,
  n_max = Inf,
  guess_max = min(1000, n_max),
  name_repair = "unique",
  num_threads = readr_threads(),
  progress = show_progress(),
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE,
  lazy = should_read_lazy()
)

read_csv(
  file,
  col_names = TRUE,
  col_types = NULL,
  col_select = NULL,
  id = NULL,
  locale = default_locale(),
  na = c("", "NA"),
  quoted_na = TRUE,
  quote = "\"",
  comment = "",
  trim_ws = TRUE,
  skip = 0,
  n_max = Inf,
  guess_max = min(1000, n_max),
  name_repair = "unique",
  num_threads = readr_threads(),
  progress = show_progress(),
  show_col_types = should_show_types(),
  skip_empty_rows = TRUE,
  lazy = should_read_lazy()
)

Read execl spreadsheets

You can use read_excel to read xlsx files.

read_excel(path, sheet = NULL, range = NULL, col_names = TRUE,
  col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = readxl_progress(), .name_repair = "unique")

read_xls(path, sheet = NULL, range = NULL, col_names = TRUE,
  col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = readxl_progress(), .name_repair = "unique")

read_xlsx(path, sheet = NULL, range = NULL, col_names = TRUE,
  col_types = NULL, na = "", trim_ws = TRUE, skip = 0,
  n_max = Inf, guess_max = min(1000, n_max),
  progress = readxl_progress(), .name_repair = "unique")

Here is the cheatsheet to import data in R: https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf

Basic Regular Expression

Play with Regular Expression

Next we learn about the basic skills to understand, recognize, and write your own regular expression codes.

  1. What is regular expression?
  2. Why should we learn RegEx?
  3. How to write your own RegEx? If you’re interested in text analysis or broad NLP, RegEX would be one element of your necessary toolkit.

What is REGULAR EXPRESSION?

Regular expressions, or RegEXes, are an abstract way to express patterns that match strings. It is widely used in natural language processing, for instance, to find and replace certain strings with certain patterns.

One simple example is, if you are a data scientist, to create a variable that contains all email address from a large amount of texual data. You can use certain email name pattern to capture all potential email addresses.

require(glue)
# We use stringr package's some functions for demo
# Here is an email example
email_text = "Josh Zhang's email is example@email.com"
# RegEx for email address
email_regex = "[a-z]{7}@.*$"
# extract the address
email_search = str_extract_all(email_text,email_regex)
print(glue("we have a match: ", email_search[[1]]))
## we have a match: example@email.com

Here, we used str_extract_all function to search pattern within the email_text. The method returns a match object if the search is successful. You can access the matched text like email_search[[1]]

Why should we use or learn RegEX?

If you a data scientist or quantitative scholar, most of your time is to construct some databases based on a large amount of raw data. Obviously, it is not plausible to do manual coding. Even if you could, it would be very expensive and time costly in the big data era. To improve our efficiency, it is better for us to become a master of RegEX.

There are couple of ways to generate RegEX automatically using third party algorithms. For instance, you can use RegEx tester tools such as regex101https://regex101.com/.

But before doing that, you have to be able to understand the meaning of the RegEX pattern.

How to write your own RegEX?

There are certain basic RegEXes we need to familier with, including anchors, quantifiers, operators, character classes,boundaries,flags, groupings,back-referencing, look forward and backward,etc.We will go through these topics one by one.

Let us first see some Sepcial Characters

Regular expressions can contain both special and ordinary characters.

Most ordinary characters, like ‘A’, ‘a’, or ‘0’, are the simplest regular expressions; they simply match themselves.

Special characters are characters that are interpreted in a special way by a RegEx engine.

The special characters are:

[] . ^ $ * + ? {} () \ |

[ ] - Square brackets

Square brackets specifies a set of characters you want to match.

For instance, [ABCabc] matches if the string contains any of the A, B, C, a, b, or c.

You can also specify a range of characters using - inside square brackets.

[A-Ca-c] is the same as [ABCabc].

[1-9] is the same as [123456789].

[0-38] is the same as [01238].

You can use caret ^ symbol at the start of a square-bracket to exclude certain characters (equate to saying “NOT SOME CHARS”).

[^ABCabc] : any character except A OR B OR C OR a OR b OR c.

[^0-9] : any non-digit character.

.- Dot

The dot symbol . matches any single character (except newline ‘\n’).

^ - Caret

The caret symbol ^ checks if a string starts with a certain character.

$ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

* - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

+ - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it

{} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

| - Alternation

Vertical bar | is used for alternation (or operator).

() - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

\ - Backslash

Backlash \ is used to escape various characters including all metacharacters. For example,

Here, \$ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

let check some special sequence

Special sequences make commonly used patterns easier to write. Here’s a list of special sequences:

\A - Matches if the specified characters are at the start of a string.

\b - Matches if the specified characters are at the beginning or end of a word.

\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

\d - Matches any decimal digit. Equivalent to [0-9]

\D - Matches any non-decimal digit. Equivalent to [^0-9]

\s - Matches where a string contains any whitespace character. Equivalent to [\t\n\r\f\v].

\S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

\w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

\W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

\Z - Matches if the specified characters are at the end of a string.

Anchors: ^, $, \b, \B

All the patterns youe’ve seen so far will find a match anywhere within a string, which is usually - but not always - what you want. This is the purpose of an anchor; to make sure that you are at a certain boundary before you continue the match.

The caret symbol ^ matches the beginning of a line, and the dollar sign $ matches the end of a line.

The other two anchors are \b and \B, which stand for a “word boundary” and “non-word boundary”. Here is an exmaple:

\bame will capture “i am american”, but not “i am an blamer”.

\bnational will capture “national flag is abcde” but not “internatonal organizations are abbbb”.

**Quantifiers: {}, +, *, ?

These symbols are quantifiers used in regEX to indicate the repetition or the frequency of certain characters

{a,b} a times at least, b times at most

+ at least once

* zero times or more

? zero times or once

Greedy Match or Lazy Match

We use * or ? to do greedy (longest search) or lazy match (shortest search).

For instance, pattern “i.*nat” will capture i am a national representative in an internat” in the string “i am a national representative in an international organization”.

But “i.*?nat” will only capture “i am a nat”.

Let us do a quick test

r”(\d|[a-c]){2}.\? .\$“

# A sample text string where regular expression is searched. 
string = "STONY BROOK UNIVERSITY ZIPCODE IS 11794, MY ID IS 12398774, YOU ID IS 8966AGVGG"
# A sample regular expression to find digits. 
pattern = '\\d+'            
match = str_extract_all(string,pattern)
print(match) 
## [[1]]
## [1] "11794"    "12398774" "8966"

The End