Introduction to ‘drglm’

Package ‘drglm’ provide users to fit GLMs to big data sets which can be attached into memory. This package uses popular “Divide and Recombine” method to fit GLMs to large data sets. Lets generate a data set which is not that big but serves our purpose.

Generating a Data Set

set.seed(123)
#Number of rows to be generated
n <- 1000000
#creating dataset
dataset <- data.frame( 
Var_1 = round(rnorm(n, mean = 50, sd = 10)), 
Var_2 = round(rnorm(n, mean = 7.5, sd = 2.1)), 
Var_3 = as.factor(sample(c("0", "1"), n, replace = TRUE)), 
Var_4 = as.factor(sample(c("0", "1", "2"), n, replace = TRUE)), 
Var_5 = sample(0:6, n, replace = TRUE), 
Var_6 = round(rnorm(n, mean = 60, sd = 5))
)

This data set contains six variables of which four of them are continuous generated from normal distribution and two of them are catagorial and other one is count variable. Now we shall fit different GLMs with this data set below.

Fitting Multiple Linear Regression Model

Now, we shall fit multiple linear regression model to the data sets assuming Var_1 as response variable and all other variables as independent ones.

nmodel= drglm::drglm(Var_1 ~ Var_2+ Var_3+ Var_4+ Var_5+ Var_6,  
                     data=dataset, family="gaussian", 
                     fitfunction="speedglm", k=10)
#Output
print(nmodel)
##                  Estimate standard error     t value  Pr(>|t|)
## (Intercept) 49.9317067180    0.127368889 392.0243567 0.0000000
## Var_2       -0.0045654674    0.004721297  -0.9669943 0.3335469
## Var_31       0.0141079507    0.020006611   0.7051644 0.4807079
## Var_41      -0.0071647241    0.024493999  -0.2925094 0.7698972
## Var_42       0.0029739494    0.024507130   0.1213504 0.9034135
## Var_5       -0.0005907645    0.005001232  -0.1181238 0.9059696
## Var_6        0.0015528677    0.001996831   0.7776662 0.4367658
##                        95% CI
## (Intercept) [ 49.68 , 50.18 ]
## Var_2           [ -0.01 , 0 ]
## Var_31       [ -0.03 , 0.05 ]
## Var_41       [ -0.06 , 0.04 ]
## Var_42       [ -0.05 , 0.05 ]
## Var_5        [ -0.01 , 0.01 ]
## Var_6            [ 0 , 0.01 ]

Fitting Binomial Regression (Logistic Regression) Model

Now, we shall fit logistic regression model to the data sets assuming Var_3 as response variable and all other variables as independent ones.

bmodel=drglm::drglm(Var_3~ Var_1+ Var_2+ Var_4+ Var_5+ Var_6, 
                    data=dataset, family="binomial",
                    fitfunction="speedglm", k=10)
#Output

print(bmodel)
##                  Estimate Odds Ratio standard error    z value   Pr(>|z|)
## (Intercept) -0.0509893145  0.9502888   0.0272855408 -1.8687302 0.06166036
## Var_1        0.0001409687  1.0001410   0.0001999610  0.7049813 0.48082188
## Var_2       -0.0010358477  0.9989647   0.0009440378 -1.0972524 0.27253109
## Var_41      -0.0008665869  0.9991338   0.0048975644 -0.1769424 0.85955361
## Var_42       0.0008942254  1.0008946   0.0049002168  0.1824869 0.85520063
## Var_5       -0.0006510342  0.9993492   0.0010000010 -0.6510335 0.51502484
## Var_6        0.0008820860  1.0008825   0.0003992730  2.2092302 0.02715863
##                       95% CI
## (Intercept)     [ -0.1 , 0 ]
## Var_1              [ 0 , 0 ]
## Var_2              [ 0 , 0 ]
## Var_41      [ -0.01 , 0.01 ]
## Var_42      [ -0.01 , 0.01 ]
## Var_5              [ 0 , 0 ]
## Var_6              [ 0 , 0 ]

Fitting Poisson Regression Model

Now, we shall fit poisson regression model to the data sets assuming Var_5 as response variable and all other variables as independent ones.

pmodel=drglm::drglm(Var_5~ Var_1+ Var_2+ Var_3+ Var_4+ Var_6, 
                    data=dataset, family="poisson", 
                    fitfunction="speedglm", k=10)

#Output
print(pmodel)
##                  Estimate Odds Ratio standard error     z value  Pr(>|z|)
## (Intercept)  1.111764e+00  3.0397171   7.844631e-03 141.7229717 0.0000000
## Var_1       -8.530443e-06  0.9999915   5.770457e-05  -0.1478296 0.8824773
## Var_2       -3.972801e-04  0.9996028   2.724303e-04  -1.4582817 0.1447629
## Var_31      -8.719392e-04  0.9991284   1.154426e-03  -0.7553012 0.4500683
## Var_41       1.501374e-04  1.0001501   1.413838e-03   0.1061914 0.9154305
## Var_42       2.088608e-03  1.0020908   1.413911e-03   1.4771853 0.1396260
## Var_6       -1.584737e-04  0.9998415   1.152213e-04  -1.3753856 0.1690119
##                     95% CI
## (Intercept) [ 1.1 , 1.13 ]
## Var_1            [ 0 , 0 ]
## Var_2            [ 0 , 0 ]
## Var_31           [ 0 , 0 ]
## Var_41           [ 0 , 0 ]
## Var_42           [ 0 , 0 ]
## Var_6            [ 0 , 0 ]

Fitting Multinomial Logistic Regression Model

Now, we shall fit multinomial logistic regression model to the data sets assuming Var_4 as response variable and all other variables as independent ones.

mmodel=drglm::drglm(Var_4~ Var_1+ Var_2+ Var_3+ Var_5+ Var_6, 
              data=dataset,family="multinomial",
              fitfunction="multinom", k=10)
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109859.924516
## final  value 109859.701793 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109858.025066
## final  value 109856.247559 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109857.168027
## final  value 109855.028006 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109857.411318
## final  value 109856.006772 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109857.575481
## final  value 109854.463544 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109856.817080
## final  value 109853.551812 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109858.042179
## final  value 109856.223538 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109856.773011
## final  value 109853.685139 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109858.213223
## final  value 109857.373232 
## converged
## # weights:  21 (12 variable)
## initial  value 109861.228867 
## iter  10 value 109855.898011
## final  value 109854.318130 
## converged
#Output
print(mmodel)
##                Estimate.1    Estimate.2 Odds Ratio.1 Odds Ratio.2
## (Intercept)  2.830473e-02  1.158303e-02    1.0287091    1.0116504
## Var_1       -7.176467e-05  2.793888e-05    0.9999282    1.0000279
## Var_2        1.360669e-03  2.468253e-04    1.0013616    1.0002469
## Var_31      -8.820202e-04  9.447668e-04    0.9991184    1.0009452
## Var_5        1.095235e-04  1.564296e-03    1.0001095    1.0015655
## Var_6       -5.798311e-04 -3.696137e-04    0.9994203    0.9996305
##             standard error.1 standard error.2   z value.1  z value.2 Pr(>|z|).1
## (Intercept)     0.0333081745     0.0333282475  0.84978341  0.3475438  0.3954455
## Var_1           0.0002448119     0.0002449394 -0.29314206  0.1140644  0.7694136
## Var_2           0.0011557794     0.0011563869  1.17727369  0.2134452  0.2390863
## Var_31          0.0048975648     0.0049002165 -0.18009363  0.1928010  0.8570791
## Var_5           0.0012242948     0.0012249480  0.08945842  1.2770303  0.9287176
## Var_6           0.0004888216     0.0004890920 -1.18618131 -0.7557140  0.2355507
##             Pr(>|z|).2 95% lower CI.1 95% lower CI.2 95% upper CI.1
## (Intercept)  0.7281828  -0.0369780883  -0.0537391374   0.0935875565
## Var_1        0.9091867  -0.0005515872  -0.0004521336   0.0004080579
## Var_2        0.8309797  -0.0009046173  -0.0020196513   0.0036259547
## Var_31       0.8471148  -0.0104810708  -0.0086594811   0.0087170304
## Var_5        0.2015915  -0.0022900502  -0.0008365582   0.0025090972
## Var_6        0.4498207  -0.0015379039  -0.0013282165   0.0003782417
##             95% upper CI.2
## (Intercept)   0.0769051921
## Var_1         0.0005080113
## Var_2         0.0025133019
## Var_31        0.0105490147
## Var_5         0.0039651499
## Var_6         0.0005889891

In fitting of four models, we used fitfunction= “speedglm” as fitting function for smaller computation time. In fitfunction= “glm” can also be used which will provide the exact same result as yielded by fitfunction=“speedglm”.

Note that, function ‘drglm’ is designed for fitting GLMs to data sets which can be fitted into memory. To fit data set that is larger than the memory, function ‘big.drglm’ can be used. Users are requested to check the respective vignette.