Plankton Data Analysis

4 minute read

This is a homework assignment from Math 50 at Dartmouth College.

Question-1.

  • Is maximum-likelihood estimator σ̃2 of σ2 an unbiased estimator ? Verify your answer. Comment on the change of the value of E(σ̃2) − σ2 as n goes to infinity.

Question-2.

  • Consider the shear strength vs age relation using the propellant data.
prop<-read.table("https://math.dartmouth.edu/~m50f17/propellant.csv", header=T, sep=",")
shearS<-prop$ShearS
age<-prop$Age

plot(age, shearS, xlab = "Propellant Age (weeks)", ylab = "Shear S. (psi)", main = "Rocket Propellant")
fitted <- lm(shearS ~ age)

ageList <- seq(0,25,0.5)
cList <- predict(fitted, list(age = ageList), int = "c", level = 0.95)
pList <- predict(fitted, list(age = ageList), int = "p", level = 0.95)

matlines(ageList, pList, lty='solid' ,  col = "red")

shTest <- shearS^5 

library(MASS)
bc = boxcox(shTest ~ age)

lambda <- bc$x[which.max(bc$y)]

1 doesn’t seem to be in the approximate CI, therefore Box-Cox method suggests a transformation with λ = 0.3030303.

Part (a)

  • Recalculate the coefficients of the fitted linear regression model using the vector equations we obtained.
prop<-read.table("https://math.dartmouth.edu/~m50f17/propellant.csv", header=T, sep=",")
ShearS<-prop$ShearS
Age<-prop$Age

c=rep(1,length(Age))
x<-cbind(c,Age)
b_hat = solve((t(x) %*% x)) %*% t(x) %*% ShearS

β0: 2627.822359

β1: -37.1535909

Part (b)

  • Suppose that the expectation of the initial shear strength is known to be 2400. Write the corresponding model (should involve only one parameter β1). Calculate 95% CI on β1.
b1_hat_new = sum((ShearS-2400)*Age)/sum((Age)^2)
yhat = 2400 + b1_hat_new*Age
MSres = sum((ShearS - yhat)^2)/length(ShearS)
lower_lim = b1_hat_new - qt(0.975,length(ShearS)-1)*(MSres/sum(Age^2))^0.5
upper_lim = b1_hat_new + qt(0.975,length(ShearS)-1)*(MSres/sum(Age^2))^0.5

Confidence interval with initial shear strength set to 2400: (-28.5287729, -19.7460905)

Example : Phytoplankton Population

  • A scientist is trying to model the relation between phytoplankton population in the city public water supply and concentration of two substances. The sample data is at : https://math.dartmouth.edu/~m50f17/phytoplankton.csv

  • where headers are
    • pop : population of phytoplankton (y)
    • subs1 : concentration of substance-1 (x1)
    • subs2 : concentration of substance-2 (x2)
  • Lets consider a guessed model
    y = 200 + 10x1 − 15x2

  • Below is the corresponding code to plot the scatter diagram and the above plane.
# Note: Run the following in R console if you get errors in plotting or library loading : 
#  install.packages("scatterplot3d") 
#  install.packages("plot3D") 

library("plot3D")
library("scatterplot3d")

# Loading data 
pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2

# Create a mesh
meshP <- mesh( seq(min(subs1),max(subs1),0.03)  , seq(min(subs2),max(subs2),0.03) )
x1Mesh <- meshP$x 
x2Mesh <- meshP$y 

myModel <- 200 + 10*x1Mesh  - 15 *x2Mesh   

# Below is the code to plot the scatter diagram with red markers and your model
# You need to set two variables before calling : 
# myModel : your model  
# PLOTTING # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 15 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

# You can also change the view angle to visually test the model 
# PLOTTING # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 125 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

Question-3

  • For the phytoplankton population data use regression model of the form y = β0 + β1x1 + β2x2 + ε.
  • Fit a multiple linear regression model (using (3.13) and hat matrix H). Plot scatter diagram and fitted plane. Print predicted coefficients.
  • Calculate R2 and adjusted R2.
pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2

c = rep(1,length(subs1))
x = cbind(c,subs1,subs2)
H = solve((t(x) %*% x)) %*% t(x)

b_hat = (H %*% pop)

β0: 182.3173554

β1: 11.9236774

β2: -12.642746

meshP <- mesh( seq(min(subs1),max(subs1),0.03)  , seq(min(subs2),max(subs2),0.03) )
x1Mesh <- meshP$x 
x2Mesh <- meshP$y 
myModel <- b_hat[1] + b_hat[2]*x1Mesh + b_hat[3]*x2Mesh

sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 15 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

yhat = b_hat[1] + b_hat[2]*subs1 + b_hat[3]*subs2
SSr = sum((yhat - mean(pop))^2)
SSres = sum((pop - yhat)^2)
SSt = sum((pop - mean(pop))^2)
Rsq = SSr/SSt
Rsq_adj = 1 - (SSres/(length(pop)-length(b_hat)))/(SSt/(length(pop)-1))

R2: 0.3609532

R2adj: 0.3522587

Question-4

  • This time use a more general model for the phytoplankton population data ; y = β0 + β1x1 + β2x2 + β11x12 + β22x22 + β12x1x2 + ε.
  • Fit a multiple linear regression model (using (3.13) and hat matrix H). Plot scatter diagram and fitted surface. Print predicted coefficients.
  • Calculate R2 and adjusted R2. Compare with the previous model.
pData <- read.table("https://math.dartmouth.edu/~m50f17/phytoplankton.csv", header=T, sep=",")
pop <- pData$pop
subs1 <- pData$subs1
subs2 <- pData$subs2

c = rep(1,length(subs1))
x = cbind(c,subs1,subs2,subs1^2,subs2^2,subs1*subs2)
H = solve((t(x) %*% x)) %*% t(x)

b_hat = H %*% pop
cat("beta0: ", b_hat[1],"\nbeta1: ",b_hat[2],"\nbeta2: ",b_hat[3],"\nbeta3: ",b_hat[4],"\nbeta4: " ,b_hat[5],"\nbeta5: " ,b_hat[6],"\n")
## beta0:  -262.3596 
## beta1:  62.80095 
## beta2:  170.4656 
## beta3:  -30.65678 
## beta4:  -19.38499 
## beta5:  3.688058
meshP <- mesh( seq(min(subs1),max(subs1),0.03)  , seq(min(subs2),max(subs2),0.03) )
x1Mesh <- meshP$x
x2Mesh <- meshP$y
myModel <- b_hat[1] + b_hat[2]*x1Mesh + b_hat[3]*x2Mesh + b_hat[4]*x1Mesh^2 + b_hat[5]*x2Mesh^2 + b_hat[6]*x1Mesh*x2Mesh

sc1 <- scatterplot3d(subs2,subs1,pop, pch=17 , type = 'p', angle = 15 , highlight.3d = T ) 
sc1$points3d (x2Mesh,x1Mesh, myModel, cex=.02, col="blue")

yhat = b_hat[1] + b_hat[2]*subs1 + b_hat[3]*subs2 + b_hat[4]*subs1^2 + b_hat[5]*subs2^2 + b_hat[6]*subs1*subs2
SSr = sum((yhat - mean(pop))^2)
SSres = sum((pop - yhat)^2)
SSt = sum((pop - mean(pop))^2)
Rsq = SSr/SSt
Rsq_adj = 1 - (SSres/(length(pop)-length(b_hat)))/(SSt/(length(pop)-1))

R2: 0.9431728

R2adj: 0.9411997

The R2 found in Question-4 is far larger than the R2 found in Question-3. The same relation is true when comparing the respective R2-adjusted values. This means the model from Question-4 better approximates population than the model from Question-3 does.

Categories:

Updated: