Multi Variate Regression Analysis
From ICE Primer: A Tobacco Control Research Methodology Primer
Multivariate regression analysis essentially involves fitting a straight line to data points represented by a dependent (y) variable and several independent (x) variables. In other words a linear relationship is (usually) assumed between a dependent and a single or several independent variables. This methodology reveals not only whether the relationship between the y and a specific x variable is statistically significant, controlling for other factors, but also gives information on the magnitude of the specific relationship through a coefficient estimate. A specific coefficient estimate gives the marginal impact of a 1 unit increase in the ‘x’ variable with respect to the dependent variable, holding the effects of other possible determinants as constant.
The benefits of using multivariate regression models is that they yield specific impacts of a change in an explanatory variable (with respect to the dependent variable) while holding the effects of other potential factors.
For example, let us use a simple model which is typically used in smoking related studies which rely on smoking data:
SMOKijt = b0 + b1TAXESjt + b2GENijt + b3FRNDSijt + b4MOMijt + b5DADijt + eijt
The above model is not intended to be comprehensive and is merely for the purpose of illustration. Nonetheless, we include right hand side independent variables that should intuitively - and have been confirmed as - significant determinants of smoking participation.
This type of model is known as a 'levels' specification in which all variables - dependent and independent- are retained in their original state and are untransformed.
The dependent variable SMOKijt is, of course, a measure of smoking participation and is in most cases coded as a dichotomous or 'dummy' variable. SMOKijt takes a value of 1 if the survey respondent admits to some smoking in the past and is otherwise. The subscripts ijt describe the nature of the data. In this case each observation refers to a respondent i residing in province j at time t. TAXESjt only varies across provinces (j) and over time (t), and is in dollars per pack. GEN − ijt is also a dummy variable which takes a value of 1 for a specific sex (male or female) and is 0 otherwise. Let us assume that GENijt is 1 if the participant is a female and 0 for a male. FRNDSijt is a representation of peer effects and is a continuous variable representing the number of friends the survey participant reports as smokers. On the other hand, MOMijt and DADijt are dummy variables which take a value of 1 if the respondent's mother and father are smokers, respectively, and are 0 otherwise. Finally eijt is an error term which captures the impacts of other determinants of smoking participation that have not been included in the specific model.
b0 is the intercept of the fitted line equation. Researchers are more interested in the values of the coefficient estimates of b1 to b5, all of which give a marginal impact of an increase in the associated independent variable or covariate. In order to illustrate how coefficient estimates should be interpreted, assume that we have used Ordinary Least Squares (OLS) to estimate the above multivariate regression model and obtained the following results. By using OLS, the model becomes a linear probability specification. The term 'linear probability' refers to the 1-0 nature of the dependent variable. As a result the coefficient estimate of an explanatory variable is the marginal change in the probability of being a smoker as a result of a 1 unit change in an independent variable. Note that each of these coefficients have a hat on top in order to designate them as coefficient estimates.
The coefficient estimate of b1, (
) yields the average response of survey respondents to a 1 dollar change in taxes, controlling for other factors. Specifically, it implies that a $1 increase in taxes results in a 0.15 reduction in the probability of smoking.
suggests that a survey respondent experiences a 0.3 increase in the probability of being a smoker if female.
demonstrates that reporting one more friend as a smoker results in a 0.2 increase in the probability of smoking participation. Finally, the coefficient estimates of DADijt and MOMijt (
and
, respectively) demonstrate that reporting a father and mother as smokers increases a respondent's likelihood of smoking by 0.6 and 0.4, respectively, adjusting for the other variables in the model.
Of course, obtaining coefficient estimates is only one type of estimation. It is also important to obtain the standard errors for each of these coefficient estimates. From a simple perspective, standard errors allow a researcher to conduct significance tests to conclude whether a coefficient estimate is so far from a hypothesized value (usually zero) that the difference is not likely due to chance alone. In other words, if the true value for the coefficient was actually zero, whether it is likely to obtain a coefficient estimate as far or farther from zero if one were to repeat the multivariate regression analysis many times with randomly generated different samples.
Traditionally (although not exclusively), a p value is used to denote the statistical significance of a coefficient estimate, using two tailed tests. A p value equal to or less than 0.01 denotes that the statistical significance of a coefficient estimate is at the 1 % level -or that in fewer than 1% of randomly generated samples where the true value of the coefficient was zero we would obtain a coefficient estimate as far or farther from zero. In such cases we have evidence at the 1% level against the hypothesis that the true value is zero. Similarly, a p value greater than 0.01 but less than or equal to 0.05 implies statistical significance at the 5% level. Finally, the lowest threshold of statistical significance considered is usually the 10% level which arises from a p value equal to or less than 0.10 (and greater than 0.05).
