BranchGLM Vignette

Fitting GLMs

  • BranchGLM() allows fitting of gaussian, binomial, gamma, and Poisson GLMs with a variety of links available.
  • Parallel computation can also be done to speed up the fitting process, but it is only useful for larger datasets.

Optimization methods

  • The optimization method can be specified, the default method is fisher scoring, but BFGS and L-BFGS are also available.
  • BFGS and L-BFGS typically perform better when there are many predictors in the model (at least 50 predictors), otherwise fisher scoring is typically faster.
  • The grads argument is for L-BFGS only and it is the number of gradients that are stored at a time and are used to approximate the inverse information. The default value for this is 10, but another common choice is 5.
  • The tol argument controls how strict the convergence criteria are, lower values of this will lead to more accurate results, but may also be slower.
  • The method argument is ignored for linear regression and the OLS solution is used.

Initial values

  • Initial values for the coefficient estimates may be specified via the init argument.
  • If no initial values are specified, then the initial values are estimated via linear regression with the response variable transformed by the link function.

Parallel computation

  • Parallel computation can be employed via OpenMP by setting the parallel argument to TRUE and setting the nthreads argument to the desired number of threads used.
  • For smaller datasets this can actually slow down the model fitting process, so parallel computation should only be used for larger datasets.

Families

Gaussian

  • Permissible links for the gaussian family are
    • identity, which results in linear regression
    • inverse
    • log
    • square root (sqrt)
  • The most commonly used link function for the gaussian family is the identity link.
  • The dispersion parameter for this family is estimated by using the mean square error.
# Loading in BranchGLM
library(BranchGLM)

# Fitting gaussian regression models for mtcars dataset
cars <- mtcars

## Identity link
BranchGLM(mpg ~ ., data = cars, family = "gaussian", link = "identity")
#> Results from gaussian regression with identity link function 
#> Using the formula mpg ~ .
#> 
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept) 12.30000   18.72000  0.6573  0.51810  
#> cyl         -0.11140    1.04500 -0.1066  0.91610  
#> disp         0.01334    0.01786  0.7468  0.46350  
#> hp          -0.02148    0.02177 -0.9868  0.33500  
#> drat         0.78710    1.63500  0.4813  0.63530  
#> wt          -3.71500    1.89400 -1.9610  0.06325 .
#> qsec         0.82100    0.73080  1.1230  0.27390  
#> vs           0.31780    2.10500  0.1510  0.88140  
#> am           2.52000    2.05700  1.2250  0.23400  
#> gear         0.65540    1.49300  0.4389  0.66520  
#> carb        -0.19940    0.82880 -0.2406  0.81220  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 4.609 for the log-likelihood
#> Dispersion parameter taken to be 7.024 for standard errors
#> 32 observations used to fit the model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 147.5 on 21 degrees of freedom
#> AIC: 163.7

Gamma

  • Permissible links for the gamma family are
    • identity
    • inverse, this is the canonical link for the gamma family
    • log
    • square root (sqrt)
  • The most commonly used link functions for the gamma family are inverse and log.
  • The dispersion parameter for this family is estimated via maximum likelihood,
    similar to the MASS::gamma.dispersion() function.
# Fitting gamma regression models for mtcars dataset
## Inverse link
GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "inverse")
GammaFit
#> Results from gamma regression with inverse link function 
#> Using the formula mpg ~ .
#> 
#>               Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)  6.794e-02  3.843e-02  1.7680  0.09164 .
#> cyl         -1.760e-03  2.424e-03 -0.7263  0.47570  
#> disp         7.947e-06  4.379e-05  0.1815  0.85770  
#> hp           6.724e-05  5.045e-05  1.3330  0.19680  
#> drat         4.283e-04  3.398e-03  0.1261  0.90090  
#> wt           9.224e-03  4.186e-03  2.2030  0.03886 *
#> qsec        -1.739e-03  1.452e-03 -1.1980  0.24420  
#> vs           3.086e-04  4.139e-03  0.0746  0.94130  
#> am          -6.305e-04  4.287e-03 -0.1471  0.88450  
#> gear        -4.882e-03  3.210e-03 -1.5210  0.14320  
#> carb         1.025e-03  1.844e-03  0.5560  0.58410  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.008682 for the log-likelihood
#> Dispersion parameter taken to be 0.01347 for standard errors
#> 32 observations used to fit the model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0.2782 on 21 degrees of freedom
#> AIC: 152.3
#> Algorithm converged in 3 iterations using Fisher's scoring

## Log link
GammaFit <- BranchGLM(mpg ~ ., data = cars, family = "gamma", link = "log")
GammaFit
#> Results from gamma regression with log link function 
#> Using the formula mpg ~ .
#> 
#>               Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)  2.754e+00  8.709e-01  3.1620 0.004696 **
#> cyl          7.509e-03  4.862e-02  0.1544 0.878700   
#> disp         6.521e-05  8.309e-04  0.0785 0.938200   
#> hp          -8.898e-04  1.013e-03 -0.8785 0.389600   
#> drat         2.366e-02  7.609e-02  0.3110 0.758900   
#> wt          -1.655e-01  8.815e-02 -1.8780 0.074380 . 
#> qsec         3.055e-02  3.401e-02  0.8984 0.379200   
#> vs          -1.141e-03  9.792e-02 -0.0117 0.990800   
#> am           5.421e-02  9.570e-02  0.5664 0.577100   
#> gear         6.014e-02  6.948e-02  0.8655 0.396500   
#> carb        -2.277e-02  3.856e-02 -0.5905 0.561100   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 0.009637 for the log-likelihood
#> Dispersion parameter taken to be 0.01521 for standard errors
#> 32 observations used to fit the model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 0.3089 on 21 degrees of freedom
#> AIC: 155.6
#> Algorithm converged in 2 iterations using Fisher's scoring

Poisson

  • Permissible links for the Poisson family are
    • identity
    • log, this is the canonical link for the Poisson family
    • square root (sqrt)
  • The most commonly used link function for the Poisson family is the log link.
  • The dispersion parameter for this family is always 1.
# Fitting poisson regression models for warpbreaks dataset
warp <- warpbreaks

## Log link
BranchGLM(breaks ~ ., data = warp, family = "poisson", link = "log")
#> Results from poisson regression with log link function 
#> Using the formula breaks ~ .
#> 
#>             Estimate Std. Error z value  Pr(>|z|)    
#> (Intercept)  3.69200    0.04541  81.300 < 2.2e-16 ***
#> woolB       -0.20600    0.05157  -3.994 6.490e-05 ***
#> tensionM    -0.32130    0.06027  -5.332 9.729e-08 ***
#> tensionH    -0.51850    0.06396  -8.107 5.209e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 1 for the log-likelihood
#> Dispersion parameter taken to be 1 for standard errors
#> 54 observations used to fit the model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 210.4 on 50 degrees of freedom
#> AIC: 493.1
#> Algorithm converged in 3 iterations using Fisher's scoring

Binomial

  • Permissible links for the binomial family are
    • cloglog
    • log
    • logit, this is the canonical link for the binomial family
    • probit
  • The most commonly used link functions for the binomial family are logit and probit.
  • The dispersion parameter for this family is always 1.
# Fitting binomial regression models for toothgrowth dataset
Data <- ToothGrowth

## Logit link
BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit")
#> Results from binomial regression with logit link function 
#> Using the formula supp ~ .
#> 
#>             Estimate Std. Error z value Pr(>|z|)   
#> (Intercept)   1.5380     0.7860   1.956 0.050440 . 
#> len          -0.2127     0.0728  -2.921 0.003484 **
#> dose          2.0890     0.8497   2.458 0.013970 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 1 for the log-likelihood
#> Dispersion parameter taken to be 1 for standard errors
#> 60 observations used to fit the model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 72.33 on 57 degrees of freedom
#> AIC: 78.33
#> Algorithm converged in 4 iterations using Fisher's scoring

## Probit link
BranchGLM(supp ~ ., data = Data, family = "binomial", link = "probit")
#> Results from binomial regression with probit link function 
#> Using the formula supp ~ .
#> 
#>             Estimate Std. Error z value Pr(>|z|)   
#> (Intercept)  0.94020    0.46900   2.005 0.045000 * 
#> len         -0.13180    0.04195  -3.142 0.001678 **
#> dose         1.30800    0.49710   2.631 0.008502 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Dispersion parameter taken to be 1 for the log-likelihood
#> Dispersion parameter taken to be 1 for standard errors
#> 60 observations used to fit the model
#> (0 observations removed due to missingness)
#> 
#> Residual Deviance: 72.19 on 57 degrees of freedom
#> AIC: 78.19
#> Algorithm converged in 5 iterations using Fisher's scoring

Functions for binomial GLMs

  • BranchGLM has some utility functions for binomial GLMs
    • Table() creates a confusion matrix based on the predicted classes and observed classes
    • ROC() creates an ROC curve which can be plotted with plot()
    • AUC() and Cindex() calculate the area under the ROC curve
    • MultipleROCCurves() allows for the plotting of multiple ROC curves on the same plot

Table

# Fitting logistic regression model for toothgrowth dataset
catFit <- BranchGLM(supp ~ ., data = Data, family = "binomial", link = "logit")

Table(catFit)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

ROC

# Creating ROC curve
catROC <- ROC(catFit)

plot(catROC, main = "ROC Curve", col = "indianred")

Cindex/AUC

# Getting Cindex/AUC
Cindex(catFit)
#> [1] 0.7127778

AUC(catFit)
#> [1] 0.7127778

MultipleROCPlots

# Showing ROC plots for logit, probit, and cloglog
probitFit <- BranchGLM(supp ~ . ,data = Data, family = "binomial", 
                       link = "probit")

cloglogFit <- BranchGLM(supp ~ . ,data = Data, family = "binomial", 
                       link = "cloglog")

MultipleROCCurves(catROC, ROC(probitFit), ROC(cloglogFit), 
                  names = c("Logistic ROC", "Probit ROC", "Cloglog ROC"))

Using predictions

  • For each of the methods used in this section predicted probabilities and observed classes can also be supplied instead of the BranchGLM object.

preds <- predict(catFit)

Table(preds, Data$supp)
#> Confusion matrix:
#> ----------------------
#>             Predicted
#>              OJ   VC
#> 
#>          OJ  17   13
#> Observed
#>          VC  7    23
#> 
#> ----------------------
#> Measures:
#> ----------------------
#> Accuracy:  0.6667 
#> Sensitivity:  0.7667 
#> Specificity:  0.5667 
#> PPV:  0.6389

AUC(preds, Data$supp)
#> [1] 0.7127778

ROC(preds, Data$supp) |> plot(main = "ROC Curve", col = "deepskyblue")

Useful functions

  • BranchGLM has many utility functions for GLMs such as
    • coef() to extract the coefficients
    • logLik() to extract the log likelihood
    • AIC() to extract the AIC
    • BIC() to extract the BIC
    • predict() to obtain predictions from the fitted model
  • The coefficients, standard errors, Wald test statistics, and p-values are stored in the coefficients slot of the fitted model
# Predict method
predict(GammaFit)
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
#>            21.69333            21.15566            25.92103            20.27496 
#>   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
#>            17.15124            19.83386            14.55730            22.26169 
#>            Merc 230            Merc 280           Merc 280C          Merc 450SE 
#>            23.89767            18.75674            19.10374            15.10160 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
#>            16.07372            16.13726            12.20297            11.70443 
#>   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
#>            11.60706            27.90334            30.14896            30.14451 
#>       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
#>            23.41000            17.02230            17.63782            13.90043 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
#>            16.06923            28.65170            26.61054            28.59460 
#>      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
#>            17.89013            19.53099            14.11845            23.30350

# Accessing coefficients matrix
GammaFit$coefficients
#>                  Estimate   Std. Error     t value    Pr(>|t|)
#> (Intercept)  2.754211e+00 0.8709425717  3.16233309 0.004696035
#> cyl          7.509180e-03 0.0486249041  0.15443073 0.878744694
#> disp         6.521183e-05 0.0008309089  0.07848253 0.938187259
#> hp          -8.897889e-04 0.0010128913 -0.87846431 0.389632603
#> drat         2.366270e-02 0.0760938573  0.31096731 0.758891171
#> wt          -1.655137e-01 0.0881470376 -1.87769980 0.074381777
#> qsec         3.055119e-02 0.0340061853  0.89840094 0.379157557
#> vs          -1.140739e-03 0.0979227190 -0.01164938 0.990815316
#> am           5.420526e-02 0.0956958853  0.56643253 0.577103946
#> gear         6.013892e-02 0.0694813389  0.86554054 0.396522931
#> carb        -2.277255e-02 0.0385618276 -0.59054645 0.561126613