Data types

Basic data types in R: logical, integer, double, character.
- use typeof() function to check type of a variable.

typeof(1.1)

## [1] "double"

typeof("a")

## [1] "character"

typeof(TRUE)

## [1] "logical"

typeof(4L)

## [1] "integer"

typeof(4)

## [1] "double"

logical

TRUE/T
FALSE/F

if(T){
  print(TRUE)
}

## [1] TRUE

typeof(FALSE)

## [1] "logical"

is.logical(T)

## [1] TRUE

integer

1:10 represents integer sequence 1 to 10.

aseq <- 1:10 ## <- represent assign value
typeof(aseq)

## [1] "integer"

6L represents 6 is an integer.

aint = 6L
is.integer(aint)

## [1] TRUE

In R, 6 generally is double instead of integer.

is.integer(6)

## [1] FALSE

Assign values

a = 1.6; print(a) ## assign 1.6 to a

## [1] 1.6

b <- 1.6; print(b) ## assign 1.6 to b

## [1] 1.6

1.6 -> d; print(d) ## assign 1.6 to d

## [1] 1.6

Difference between = and <-

<- can be only used for value assignment.
= can be used for both value assignment and function argument.

double

typeof(1.4142)

## [1] "double"

is.double(pi)

## [1] TRUE

is.double(0L)

## [1] FALSE

is.double(0L + 1.5)

## [1] TRUE

character

You can create character using single quotes or double quotes

acharacter <- "I like Biostatistical computing"
typeof(acharacter)

## [1] "character"

bcharacter <- 'You like Biostatistical computing'
is.character(bcharacter)

## [1] TRUE

When a single quote is part of your string, you need to use double quotes.

ccharacter <- "He doesn't like Biostatistical computing"
print(ccharacter)

## [1] "He doesn't like Biostatistical computing"

1d Vector

Vector is the basic data structure in R. Two types of vectors
- Atomic vector: All elements of an atomic vector must be the same type
- List: elements of a list can be of different type.
Atomic vectors usually created with c(), short for combine
- dbl_var <- c(1, 23.1, 4.2)
- int_var <- c(1L, 11L, 6L)
- log_var <- c(TRUE, FALSE, T, F)
- chr_var <- c(“I”, “like Biistatistical computing”)

Commonly used vector functions

Functions	Meaning
length(x)	Number of elements in x
unique(x)	Unique elements of x
sort(x)	Sort the elements of x
rev(x)	Reverse the order of x
names(x)	Name the elements of x
which(x)	Indices of x that are TRUE
which.max(x)	Index of the maximum element of x
which.min(x)	Index of the minimum element of x
append(x)	Insert elements into a vector
match(x)	First index of an element in a vector
union(x, y)	Union of x and y
intersect(x, y)	Intersection of x and y
setdiff(x, y)	Elements of x that are not in y
setequal(x, y)	Do x and y contain the same elements

Example of Vector Functions

avec <- c(5,2,9,3)
length(avec)

## [1] 4

sort(avec)

## [1] 2 3 5 9

rev(avec)

## [1] 3 9 2 5

Statistical Vector Functions

Functions	Meaning
sum(x)	Sum of x
prod(x)	Product of x
cumsum(x)	Cumulative sum of x
cumprod(x)	Cumulative product of x
min(x)	Minimum element of x
max(x)	Maximum element of x
pmin(x, y)	Pairwise minimum of x and y
pmax(x, y)	Pairwise maximum of x and y
mean(x)	Mean of x
median(x)	Median of x
var(x)	Variance of x
sd(x)	Standard deviation of x
cov(x, y)	Covariance of x and y
cor(x, y)	Correlation of x and y
range(x)	Range of x
quantile(x)	Quantiles of x for given probabilities
summary(x)	Numerical summary of x

Example of Statistical Vector Functions

avec <- c(5,2,9,3)
max(avec)

## [1] 9

which.max(avec)

## [1] 3

mean(avec)

## [1] 4.75

range(avec)

## [1] 2 9

Coercion

All elements of an atomic vector must be the same type. Otherwise they will be coerced to the most flexible type.
Types from least to most flexible are: logical, integer, double and character.

typeof(c("a", 1))

## [1] "character"

x <- c(FALSE, FALSE, TRUE)
as.numeric(x)

## [1] 0 0 1

as.character(x)

## [1] "FALSE" "FALSE" "TRUE"

typeof(c(1.2,1L))

## [1] "double"

How to get help

How to get help of an R function
- ?sum
- help(“sum”)
check with best friend – google

Missing value

Missing values are denoted by NA, which is logical vector of length 1.
NA will always be coerced to be the correct type if used inside c()
(optional) You can create NA of a specific type with NA_real_, NA_integer_, NA_character_

typeof(NA)

## [1] "logical"

typeof(NA_integer_)

## [1] "integer"

typeof(NA_real_)

## [1] "double"

typeof(NA_character_)

## [1] "character"

list

Lists are different from atomic vectors because their elememts can be of any type, including lists.
Construct lists by using list()
str() function (short for structure) and gives a compact description of any R data structure.

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), list(2.3, 5.9))
str(x)

## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ :List of 2
##   ..$ : num 2.3
##   ..$ : num 5.9

Data structure

Can be organised by their dimensionality (1d, 2d, or nd) and whether they are homogeneous or heterogeneous.

	Homogeneous	Heterogeneous
1d	Atomic vector	List
2d	Matrix	Data frame
nd	Array

Note there is no 0-dimensional in R, or scalar types. Individual numbers or strings are acutally vector of length one.

More on characters

some special characters:
- " " for space
- “\n” for newline
- “\t” for tab

sentenses <- "R is a great statistical software.\n\nWe use R in Biostatistical computing class!"
sentenses

## [1] "R is a great statistical software.\n\nWe use R in Biostatistical computing class!"

cat function will recognize these special characters and print to the console:

cat(sentenses)

## R is a great statistical software.
## 
## We use R in Biostatistical computing class!

convert to upper case or lower case

achar <- "this is a dog."
print(achar)

## [1] "this is a dog."

print(toupper(achar))

## [1] "THIS IS A DOG."

print(tolower("WWW.UFL.EDU"))

## [1] "www.ufl.edu"

length of a string

use nchar to count how many characters in a string instead of length

achar <- "this is a dog."
nchar(achar)

## [1] 14

length(achar)

## [1] 1

vectorizes nchar

we can pass a vector of character to nchar

chars <- c("a dog", "a cat", "a gator")
nchar(chars)

## [1] 5 5 7

length(chars)

## [1] 3

obtaining a substring

take a sub-sequence of characters – use substr(), short for sub string.

chars <- "this is a dog"
substring(chars,1,1)

## [1] "t"

substring(chars,11,13)

## [1] "dog"

replace with sub-string.

substring(chars,11,13) <- "cat"
print(chars)

## [1] "this is a cat"

strsplit

you can split a string by a certain pattern using strsplit() function

strsplit("this is a dog", split=" ")

## [[1]]
## [1] "this" "is"   "a"    "dog"

strsplit("this is a dog", split="")

## [[1]]
##  [1] "t" "h" "i" "s" " " "i" "s" " " "a" " " "d" "o" "g"

Note the return type is a list with only one element. strsplit can be also vectorized.

strsplit(c("this is a dog", "this is a cat", "this is a gator"), split=" ")

## [[1]]
## [1] "this" "is"   "a"    "dog" 
## 
## [[2]]
## [1] "this" "is"   "a"    "cat" 
## 
## [[3]]
## [1] "this"  "is"    "a"     "gator"

paste

paste multiple strings

paste('this','is','a','dog', sep=" ")

## [1] "this is a dog"

paste0('this','is','a','dog')

## [1] "thisisadog"

avec <- c('this','is','a','dog')
nchar(avec)

## [1] 4 2 1 3

paste(c('this','is','a','dog'), collapse = " ")

## [1] "this is a dog"

substituation

use gsub to replace certain pattern within a string.

achar <- "this is a dog"
gsub("dog","cat",achar) ## pattern, replacement, x

## [1] "this is a cat"

gsub(pattern = "dog",replacement="cat",x=achar) ## pattern, replacement, x

## [1] "this is a cat"

vectorize

chars <- c("this is a dog", "this is a cat", "this is a gator")
gsub("this","that",chars) ## pattern, replacement, x

## [1] "that is a dog"   "that is a cat"   "that is a gator"

Regular expression

A regular expression or regex is a structured string to match specific patterns in the text.
grep() function allows us to scan through a vector against regex

chars <- c("this is a dog", "this is a cat", "this is a gator")
grep("gator", chars)

## [1] 3

grep("this", chars)

## [1] 1 2 3

Regular expression 2

match dog or cat

chars <- c("this is a dog", "this is a cat", "this is a gator")
grep("dog|cat", chars)

## [1] 1 2

Metacharacters

Metacharacters are special characters with a special meaning.
Square braces are used to match anything in the braces

chars <- c("this is a dog", "this is a cat", "this is a gator")
grep("[bced]", chars)

## [1] 1 2

dash inside square braces is used to indicate a range

chars <- c("this is a dog", "this is a cat", "this is a gator")
grep("[b-d]", chars)

## [1] 1 2

grep("[0-9]", chars)

## integer(0)

Metacharacters 2

“[:alnum]” matches any alphanumeric character, same as “[a-zA-Z0-9]”
“[:punct:]” matches to any punctuation mark
“[:space:]” matches to any white space character (tab and line break).
A caret inside braces matches anything except the followng words.
- “[^0-9]” matches anything but a number between 0 and 9.
- “[^aeiou]” matches anything but a lower case vowel.
A period “.” matches to any character.

Loop

for loop

for(i in 1:10){
  cat(i," ")
}

## 1  2  3  4  5  6  7  8  9  10

while loop

i <- 1
while(i <= 10){
  cat(i," ")
  i <- i + 1
}

## 1  2  3  4  5  6  7  8  9  10

Attributes

Use to store meta-data.
can be accessed individually with attr() or attributes().
construct a new object with attributes using structure() function e.g.
- structure(1:10, myAttribute=“this is a vector”)

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")

## [1] "This is a vector"

attributes(y)

## $my_attribute
## [1] "This is a vector"

Attributes 2

Three special attributes have a specific accessor function to get and set values.
- names(): a character vector giving each element a name.
- dim(): used to turn vectors into matrics and arrays.
- class(): used to implement the S3 object system.

y <- c(a=1,2:10)
names(y)

##  [1] "a" ""  ""  ""  ""  ""  ""  ""  ""  ""

names(y)[2] <- 'b'
dim(y)

## NULL

dim(y) <- c(2,5)
print(y)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

class(y)

## [1] "matrix"

factor

A factor is a vector that only contain predefined values.
Factors are used to store categorical data.
Factors are built on top of character vectors using two attributes:
- class(), “factor”, which makes them behave differently from regular character vectors.
- levels(), which defines the set of allowed values.

x <- factor(c("a", "b", "b", 'a'))
x

## [1] a b b a
## Levels: a b

class(x)

## [1] "factor"

levels(x)

## [1] "a" "b"

factors 2

factors are very useful when there exist missing class

sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels=c("m", "f"))
table(sex_char)

## sex_char
## m 
## 3

table(sex_factor)

## sex_factor
## m f 
## 3 0

Matrices

Create a matrix with colnames and rownames

a <- matrix(1:6, ncol=3, nrow=2, dimnames = list(c("row1", "row2"),
                               c("C.1", "C.2", "C.3")))
a

##      C.1 C.2 C.3
## row1   1   3   5
## row2   2   4   6

colnames(a)

## [1] "C.1" "C.2" "C.3"

rownames(a)

## [1] "row1" "row2"

ncol(a)

## [1] 3

nrow(a)

## [1] 2

Matrices

Adding a dim() attribute to an atomic vector

c <- 1:6
dim(c) <- c(3,2)
c

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

dim(c) <- c(2,3)
c

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Data frames

A data frame is a very popular way of storing data in R.
A data frame is a list of equal length vector and shares properties of both matrix and list.
- names() and colnames() are the same thing
- length() and ncol() are the same thing

df <- data.frame(x=1:3, y=c("a","b","c"),z=0)
str(df)

## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ z: num  0 0 0

cat(names(df), "same as", colnames(df))

## x y z same as x y z

cat(length(df), "same as", ncol(df))

## 3 same as 3

Data frames 2

data.frame()’s default behaviour turns strings into factors.
- Use stringsAsFactors = FALSE to suppress
- or globally set options(stringsAsFactors=FALSE)

df1 <- data.frame(x=1:3, y=c("a","b","c"),z=0)
str(df1)

## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ z: num  0 0 0

df2 <- data.frame(x=1:3, y=c("a","b","c"),z=0, stringsAsFactors=FALSE)
str(df2)

## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"
##  $ z: num  0 0 0

Subsetting – Atomic vectors

example: x <- c(2.1, 4.2, 3.3, 5.4). How can we obtain a subset of this vector?
- Positive integers: return elements at the specified position.
- Negative integers: omit elements at the specified positions.
- Logical vectors: select elements where the corresponding logical value is TRUE.
- Nothing: return the original vector.
- Zero: return a zero length vector.
- Character vectors: to return elements with matching names.

Subsetting – Atomic vectors 2

### subseting
## atomic vectors
x <- c(2.1, 4.2, 3.3, 5.4)

# Positive integer
x[c(3,1)]

## [1] 3.3 2.1

order(x)

## [1] 1 3 2 4

x[order(x)]

## [1] 2.1 3.3 4.2 5.4

x[c(1,1,1)]

## [1] 2.1 2.1 2.1

x[c(2.1, 2.9)]

## [1] 4.2 4.2

# negative integer
x[-c(1, 3)]

## [1] 4.2 5.4

# logical vector
x[c(TRUE, TRUE, FALSE, FALSE)]

## [1] 2.1 4.2

x > 3

## [1] FALSE  TRUE  TRUE  TRUE

x[x > 3]

## [1] 4.2 3.3 5.4

x[c(TRUE, TRUE, NA, FALSE)]

## [1] 2.1 4.2  NA

# nothing
x[]

## [1] 2.1 4.2 3.3 5.4

# zero
x[0]

## numeric(0)

Subsetting – Matrices and arrays

1d index for each dimension, separated by comma

a <- matrix(1:9, nrow=3)
colnames(a) <- c("A","B","C")
a

##      A B C
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

a[1:2,]

##      A B C
## [1,] 1 4 7
## [2,] 2 5 8

a[c(T,F,T), c("B","A")]

##      B A
## [1,] 4 1
## [2,] 6 3

a[,-2]

##      A C
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9

Subsetting – Data frame

Data frames possess the characteristics of both lists and matrices.

options(stringsAsFactors = FALSE)
df <- data.frame(x=1:2, y=2:1, z=letters[1:2])
df[df$x==2,]

##   x y z
## 2 2 1 b

df[c("x","z")] # like a list

##   x z
## 1 1 a
## 2 2 b

df[,c("x","z")] # like a matrix

##   x z
## 1 1 a
## 2 2 b

Subsetting – simplifying vs preserving

two types of subsetting: simplifying and preserving subsetting.
- Simplifying subsets returns the simplest possible data structure that can represent the output.
- Preserving subsetting keeps the structure of the output the same as the input.

Functions	simplifying	preserving
List	x[[1]]	x[1]
Vector	x[[1]]	x[1]
Factor	x[1:2, drop=T]	x[1:2]
Data frame	x[,1] or x[[1]]	x[, 1, drop=F] or x[1]

Matching

grades <- c(1,2,2,3,1)
info <- data.frame(grade=3:1, desc=c("Excellent", "Good", "Poor"), fail=c(F,F,T))
id <- match(grades, info$grade)
id

## [1] 3 2 2 1 3

info[id,]

##     grade      desc  fail
## 3       1      Poor  TRUE
## 2       2      Good FALSE
## 2.1     2      Good FALSE
## 1       3 Excellent FALSE
## 3.1     1      Poor  TRUE

Biostatistical Computing, PHC 6068

Basics about R

How did I learn R?

Data types

logical

integer

Assign values

double

character

1d Vector

Commonly used vector functions

Example of Vector Functions

Statistical Vector Functions

Example of Statistical Vector Functions

Coercion

How to get help

Missing value

list

Data structure

More on characters

convert to upper case or lower case

length of a string

vectorizes nchar

obtaining a substring

strsplit

paste

substituation

Regular expression

Regular expression 2

Metacharacters

Metacharacters 2

Loop

Attributes

Attributes 2

factor

factors 2

Matrices

Matrices

Data frames

Data frames 2

Subsetting – Atomic vectors

Subsetting – Atomic vectors 2

Subsetting – Matrices and arrays

Subsetting – Data frame

Subsetting – simplifying vs preserving

Matching

Save R objects

References