%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Technical Implementation Details}
\documentclass[a4paper,12pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{setspace}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{amsmath}
\usepackage{verbatim}
\usepackage{natbib}

\title{Interpreting Regression Results using Average Marginal Effects with R's \textbf{margins}}
\author{Thomas J. Leeper}

\begin{document}

\maketitle

{\abstract Applied data analysts regularly need to make use of regression analysis to understand descriptive, predictive, and causal patterns in data. While many applications of ordinary least squares yield estimated regression coefficients that are readily interpretable as the predicted change in $y$ due to a unit change in $x$, models that involve multiplicative interactions or other complex terms are subject to less clarity of interpretation. Generalized linear models that involve transformations of this linear predictor into binary, ordinal, count or other discrete outcomes lack such ready interpretation. As such, there has been much debate in the literature about how best to interpret these more complex models (e.g., what quantities of interest to extract? what types of graphical presentations to use?). This article proposes that marginal effects, specifically average marginal effects, provide a unified and intuitive way of describing relationships estimated with regression. To begin, I briefly discuss the challenges of interpreting complex models and review existing views on how to interpret such models, before describing average marginal effects and the somewhat challenging computational task of extracting this quantity of interest from regression results. I conclude with implications for statistical practice and for the design of statistical software.}


\clearpage
\onehalfspacing

<<opts, echo = FALSE, results = "hide">>=
library("knitr")
opts_knit$set(progress = TRUE, verbose = TRUE)
opts_chunk$set(echo = FALSE, results = "hide", fig.height=7, fig.width=11, out.width='\\textwidth')
has_pkgs <- requireNamespace("gapminder", quietly = TRUE) &&
    requireNamespace("stargazer", quietly = TRUE)
opts_chunk$set(eval  = has_pkgs)
@


Regression is a workhorse procedure in modern statistics. In disciplines like economics and political science, hardly any quantitative research manages to escape the use of regression modelling to describe patterns in multivariate data, to assess causal relationships, and to formulate predictions. Ordinary least squares (OLS) regression offers a particularly attractive procedure because of its limited and familiar assumptions and the ease with which it expresses a multivariate relationship as a linear additive relationship between many regressors (i.e., predictors, covariates, or righthand-side variables) and a single outcome variable. The coefficient estimates from an OLS procedure are typically easily interpretable as the expected increase in the outcome due to a unit change in the corresponding regressor.

This ease of interpretation of simple regression models, however, belies a potential for immense analytic and interpretative complexity. The generality of the regression framework means that it is easily generalized to examine more complex relationships, including the specification of non-linear relationships between regressor and outcome, multiplicative interactions between multiple regressors, and transformations via the generalized linear model (GLM) framework.\footnote{Further complexities arise from other expansions of the regression approach, such as interdependent or hierarchically organized observations, instrumental variables methods, and so on.} With this flexibility to specify potentially complex multivariate relationships comes the risk of misinterpretation \citep{BramborClarkGolder2005, BerryGolderMilton2012} and, indeed, frequent miscalculation of quantities of interest \citep{AiNorton2003, NortonWangAi2004}. Coefficient estimates in models that are non-linear or involve interactions lose their direct interpretation as unconditional marginal effects, meaning that interpretation of tabular or graphical presentations requires first understanding the details of the specified model to avoid interpretation errors. Coefficient estimates in GLMs are often not directly interpretable at all. 

For these reasons, and in the interest of making intuitive tabular and visual displays of regression results, there is a growing interest in the display of substantively meaningful quantities of interest that can be drawn from regression estimates \citep{KingTomzWittenberg2000}. This article reviews the literature on substantive interpretation of regression estimates and argues that researchers are often interested in knowing the \textit{marginal effect} of a regressor on an outcome. I propose \textit{average marginal effects} as a particularly useful quantity of interest, discuss a computational approach to calculate marginal effects, and offer the \textbf{margins} package for R \citep{Leeper2016} as a general implementation.

The outline of this text is as follows: section \ref{sec:stats} describes the statistical background of regression estimation and the distinctions between estimated coefficients and estimated marginal effects of righthand-side variables, Section \ref{sec:details} describes the computational implementation of \textbf{margins} used to obtain those quantities of interest, and Section \ref{sec:behavior} compares the results of the package to those produced by Stata's \texttt{margins} command \citep{Stata11, Williams2012}, and various R packages.

\section{Statistical Background}\label{sec:stats}

The quantity of interest typically reported by statistical software estimation commands for regression models is the regression coefficient (along with standard errors thereof, and various goodness-of-fit and summary statistics). Consider, for example, a trivial regression of country population size as a function of GDP per capita, life expectancy, and the interaction of the two. (As should be obvious, this model is not intended to carry any causal interpretation.)

<<model1, results = "asis">>=
library(gapminder)
suppressPackageStartupMessages(library("stargazer"))
gapminder[["pop"]] <- gapminder[["pop"]]/1e6
gapminder[["loggdp"]] <- log(gapminder[["gdpPercap"]])
m1 <- lm(pop ~ loggdp + lifeExp, data = gapminder)
m2 <- lm(pop ~ loggdp * lifeExp, data = gapminder)
stargazer(m1, m2, dep.var.labels = "Population Size",
          title = "Example OLS Regression Output", 
          label = "tab:model1")
@

\noindent This default output makes sense for additive linear models (i.e., ordinary least squares regression) because an estimated coefficient is readily and directly interpretable as the expected change in $y$ given a unit change in $x$, holding all other terms constant (see, for example, the coefficient estimates for model 1). A unit change in life expectancy is associated with a population that is 2,586 larger. This \textit{marginal effect} of life expectancy is constant across observations in the dataset, constant across observed values of other variables, and constant across levels of itself. The ``effect'' is thus \textit{unconditional}.

When model specifications include other kinds of terms (e.g., multiple higher powers of a given variable, log transformations, or interactions between variables), the coefficients on those variables do not and cannot clearly communicate the association between a given righthand-side variable and the outcome because that relationship is parameterized through multiple coefficients. In model 2, for example, the relationship between life expectancy and population is parameterized to account for heterogeneity due to GDP per capita (and vice versa, the relationship between GDP per capita and population is parameterized to be heterogeneous across levels of life expectancy). The estimated coefficient for life expectancy (4.412) only carries the meaning of an unconditional marginal effect when logged GDP per capita (and thus the value of \texttt{loggdp:lifeExp}) is zero, which is never. Looking at a single estimated coefficient and treating it as an unconditional marginal effect leads to numerous problematic interpretations \citep{BramborClarkGolder2005}.

<<persp, fig.cap="Expected Population (millions) by GDP/Capita and Life Expectancy", fig.lab="fig:persp">>=
library("margins")
layout(matrix(1:2, nrow = 1))
persp(m1, zlab = "", zlim = c(-180, 180), main = "Additive Only (Model 1)")
persp(m2, zlab = "", zlim = c(-180, 180), main = "Interaction Model (Model 2)")
@

Visualization of the fitted values from the model makes clear that the two specifications are actually producing fairly similar substantive estimates. Figure \ref{fig:persp} shows the fitted response surface from model 1 (left) and model 2 (right). With the exception of the fitted values at the intersection of very low levels of life expectancy and very high levels of GDP per capita (where data are sparse), the two specifications yield nearly identical substantive interpretations. Higher GDP per capita is associated with lower national population, and higher levels of life expectancy are associated with higher national population. The ambiguous substantive meaning of the results in Table \ref{tab:model1} is due to the fact that the coefficients are not intepretable as \textit{marginal effects}.

But what is a marginal effect exactly? And why is it a useful quantity of interest? Figure \ref{fig:cplot1} shows the calculation of the marginal effect of GDP per capita on population size, based upon model 1 in Table \ref{tab:model1}. Because the model is specified without interactions, the slope of the predicted relationship between $x$ and $y$ is straight linear, so the marginal effect of $x$ is estimated as the slope of this fitted value line, using the usual $\frac{\Delta Y}{\Delta X}$ calculation and can be calculated irrespective of the value of life expectancy. The visibly apparent marginal effect of -26.4 is clearly consistent with the coefficient estimate reported in Table \ref{tab:model1}, model 1.

<<cplot1, results = "show", fig.cap = "Marginal Effect of GDP per Capita on Population Size, as shown by Expected Value Plot">>=
local({
  c1 <- suppressWarnings(cplot(m1, "loggdp", n = 50, 
          xlab = "GDP Per Capita (Logged)",
          se.type = "shade",
          xlim = c(7,10), ylim = c(-30,80)))
  w <- c(which.min((c1[["xvals"]]-8)^2),
         which.min((c1[["xvals"]]-9)^2))
  # dashed
  segments(0, c1[w[2],"yvals"], 9, c1[w[2],"yvals"], lty = 2)
  segments(0, c1[w[1],"yvals"], 8, c1[w[1],"yvals"], lty = 2)
  segments(8, -30, 8, c1[w[1],"yvals"], lty = 2)
  segments(9, -30, 9, c1[w[2],"yvals"], lty = 2)
  
  # red
  segments(8, c1[w[1],"yvals"], 9, c1[w[2],"yvals"], 
           col = "red", lwd = 3)
  segments(8, c1[w[2],"yvals"], 9, c1[w[2],"yvals"], 
           col = "red", lwd = 3)
  segments(8, c1[w[2],"yvals"], 8, c1[w[1],"yvals"], 
           col = "red", lwd = 3)
  text(8, 20, expression(Delta(y) == -26.4), pos = 2, 
       cex = 2, col = "red")
  text(8.5, 0, expression(Delta(x) == 1), cex = 2, col = "red")
  text(9, 40, expression(ME(x) == -26.4), cex = 2, col = "red")
})
@


%<<margins1, echo = TRUE, results = "show">>=
%cbind(summary(margins(m1))[[1]][,2:4],
%      summary(margins(m2))[[1]][,2:4])
%@


\subsection{Generalized Linear Models}

Furthermore, when models involve a non-linear transformation (e.g., generalized linear models such as logit or probit), the coefficients are typically not directly interpretable at all (even when no power terms, interactions, or other complex terms are included). This is because the coefficients express the influence of each separate variable onto the latent, linear scale of the outcome, not the discrete (or probability) scale of the observed outcome \citep{Long1997}. For example, in a logistic regression, the coefficients express the marginal effect of each included variable in terms of the change in log-odds that the outcome equals 1 given a unit change in the independent variable. In order to express the more intuitive change in the predicted probability that the outcome equals 1 requires conditioning on all other included variables (i.e., selecting a set of values for all righthand-side variables) and running that set of values through the link function to convert log-odds to probabilities, thus making the marginal effect (in probability terms) of one variable a function of all other variables included in the model. As such, for both OLS and GLMs, the coefficients estimated for a regression model can often provide unintuitive insight into the statistical relationships between variables and, worse, can frequently fail to communicate those relationships at all (as in GLMs, where the size of coefficients can be completely uninformative about the size of the ``effect'' of a given righthand-side variable).


\subsection{Quantities of Interest}

Several quantities of interest may be derived from any regression model result. The first and most obvious, are the coefficient estimates themselves (as have already been discussed). Others include fitted or predicted values of the outcome generated from the model estimates, first-differences or discrete changes, marginal effects or partial effects.

Among these, fitted (predicted) values communicate the shape and position of the fitted regression surface (or line in a simple bivariate regression) across the possibly multidimensional covariate space. In essence, if we express the regression estimates as a function of $X$, $\hat{y} = \hat{f}(X)$, we can evaluate this function at any levels of the covariates. Predicted values communicate what outcome value would be expected given the patterns observed between covariates and the outcome. We might be interested in at least three different quantities that can be calculated from the regression fit:

\begin{enumerate}
\item Fitted values at representative, or particular, values of $X$
\item Fitted values at the mean of $X$
\item ``Average'' fitted values
\end{enumerate}

The first of these is simply the evaluation of the fitted value function $\hat{f}(X = x)$, for some particular value $x$ or combination of covariate values. This might be used, for example, to describe what population we would expect to see for a country with a particular combination of life expectancy and GDP values. It would simply report to us the value of the population along the surface shown in Figure \ref{fig:persp}.

The second, fitted values at the mean of $X$, is a summary measure that chooses the values of covariates based upon properties of the distribution of $X$ rather than based on some theoretical reason. Unfortunately, such a quantity is not particularly useful to know because the multidimensional mean of $X$ may not be an observed (or even observable) value.

The third measure, average fitted values, calculates the value $\hat{y}$ for every case in the data and averages the resulting fitted values. The interpretation of this quantity is as the average (or typical) outcome we would expect to observe were our model an accurate representation of the data-generating process for the outcome.

% use 3d plot in OLS to show relationship between fitted values and marginal effects w/ interaction

From these predicted values, one could generate a second class of quantities of interest: discrete changes or ``first differences.'' As their name implies, this quantity expresses the \textit{change} in the predicted outcome that occurs with a given change in the value of covariate(s). Common shifts might include the change in a value of a categorical variable (e.g., from ``male'' to ``female'') or a theoretical meaningful change in a numeric variable (e.g., from its minimum to maximum values).

Finally, we may be interested in the \textit{marginal effect} of a given variable: that is, the slope of the regression surface with respect to a given covariate. The marginal effect communicates the rate at which $y$ changes at a given point in covariate space, with respect to one covariate dimension and holding all covariate values constant. This quantity is particularly useful because it is intuitive --- it is simply a slope --- and because it can be calculated from essentially any set of regression estimates.

To calculate marginal effects requires the application of partial derivatives. A marginal effect is, in essence, the slope of multi-dimensional surface with respect to one dimension of that surface.\footnote{In practice, when categorical covariates are used in a model, the ``marginal effect'' label is applied to quantities calculated as discrete changes.} Marginal effects are a particularly useful quantity of interest because, in the case of OLS, they translate the coefficients estimated from any model parameterization back into the quantity that is expressed by coefficients in any unconditional model: namely the marginal contribution of $x$ to the outcome. As with fitted values, we may be interested in one of three different quantities of interest derived from marginal effects:

\begin{itemize}
\item Marginal effects at representative values (MERs)
\item Marginal effects at means (MEMs)
\item Average marginal effects (AMEs)
\end{itemize}

These quantities are analogous to the three fitted values quantities discussed earlier. MERs calculate the marginal effect of each variable at a particular combination of $X$ values that is theoretically interesting. MEMs calculate the marginal effects of each variable at the means of the covariates. AMEs calculate marginal effects at every observed value of $X$ and average across the resulting effect estimates.

AMEs are particularly useful because --- unlike MEMs --- produce a single quantity summary that reflects the full distribution of $X$ rather than an arbitrary prediction. Together with MERs, AMEs have the potential to convey a considerable amount of information about the influence of each covariate on the outcome. Because AMEs average across the variability in the fitted outcomes, they can also capture variability better than MEMs. While MERs provide a means to understand and communicate model estimates at theoretically important combinations of covariate values, AMEs provide a natural summary measure that respects both the distribution of the original data and does not rely on summarizing a substantively unobserved or unobservable $X$ value (as in MEMs).


\subsection{Response to Earlier Critiques}

It is worth noting that \citet{KingTomzWittenberg2000} make a strong case against marginal effects calculations premised largely on two points: (1) the computational challenge of calculating marginal effects, and (2) the common (at that time) omission of statements of uncertainty in the presentation of derivative-based quantities of interest (see, for example, displays in \citealt{Long1997}). While the former is challenging, the next section discusses various computational approaches to derivation including an approximation that is fully general. \textbf{margins} provides a first-of-its-kind implementation of this approach in the R language. This technical innovation makes this concern less problematic than it once was. With respect to the latter critique, the subsequent section then discusses the delta method approach to approximating variances of marginal effects, something \citet{KingTomzWittenberg2000} discuss but dismiss. While \textbf{margins} adopts a delta method approach, the simulation method described by these authors is also implemented as an alternative, alongside a third alternative (bootstrapping). 


\section{Computational Details}\label{sec:details}

This section describes the basic computational features of \textbf{margins}, which offers a fully general approach to estimating marginal effects from an arbitrary modelling result. Specifically, it describes the procedures for calculating marginal effects from the information stored in a model object (e.g., an R object of class \texttt{"lm"} or \texttt{"glm"}) and the procedures for estimating the variances of those marginal effects.

\subsection{Symbolic Derivatives}\label{sec:symbolic}

If we were to calculate marginal effects by hand, we might rightly choose to rely on manual or ``symbolical'' differentiate (i.e., using a set of known derivation rules, derive the partial derivative of a regression equation with respect to its constituent variables). The advantage of this approach is that, with the exception of any human error, produces numerically perfect calculations of the marginal effects. Yet this approach is not particularly easy to implement computationally. It may seem straightforward, but even in relatively simple regression specifications, the formulae for marginal effects quickly become complicated.

\begin{table}
\caption{Marginal Effect Formulae for a Few Simple Regression Equations}\label{tab:formulae}
\begin{center}
\begin{tabular}{llp{0.7in}l} \toprule
& Regression Equation & ME with respect to & ME Formula \\ \midrule
\rule{0pt}{4ex} (1) & \multirow{2}{*}{
$\begin{aligned}
\hat{Y} = &\hspace{1ex} \beta_0 + \beta_1 X_1 + \\ &\hspace{1ex} \beta_2 X_2 + \beta_3 X_1 X_2
\end{aligned}$
} & $X_1$ & $\begin{aligned}\frac{\partial Y}{\partial X_1} = \beta_1 + \beta_3 X_2\end{aligned}$ \\ \cline{3-4}
\rule{0pt}{4ex} & & $X_2$ & $\begin{aligned}\frac{\partial Y}{\partial X_2} = \beta_2 + \beta_3 X_1\end{aligned}$ \\ \midrule
\rule{0pt}{4ex} (2) & \multirow{2}{*}{
$\begin{aligned}
\hat{Y} = &\hspace{1ex} \beta_0 + \beta_1 X_1 + \\ &\hspace{1ex} \beta_2 X^2 + \beta_3 X_2
\end{aligned}$
} & $X_1$ & $\begin{aligned}\frac{\partial Y}{\partial X_1} = \beta_1 + 2\beta_2 X_1\end{aligned}$ \\ \cline{3-4}
\rule{0pt}{4ex} & & $X_2$ & $\begin{aligned}\frac{\partial Y}{\partial X_2} = \beta_3\end{aligned}$ \\ \midrule
\rule{0pt}{4ex} (3) & \multirow{3}{*}{
$\begin{aligned}
\hat{Y} = &\hspace{1ex} \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \\ &\hspace{1ex} \beta_3 X_3 + \beta_4 X_1 X_2 + \\
&\hspace{1ex} \beta_5 X_2 X_3 + \beta_6 X_1 X_3 + \\
    &\hspace{1ex} \beta_7 X_1 X_2 X_3 
\end{aligned}$
} & $X_1$ &
$\begin{aligned} \frac{\partial Y}{\partial X_1} = & \beta_1 + \beta_4 X_2 + \\ & \beta_6 X_3  + \beta_7 X_2 X_3 \end{aligned}$ \\ \cline{3-4}
\rule{0pt}{4ex} & & $X_2$ & 
$\begin{aligned} \frac{\partial Y}{\partial X_2} = & \beta_2 + \beta_4 X_1 + \\ & \beta_5 X_3 + \beta_7 X_1 X_3 \end{aligned}$ \\ \cline{3-4}
\rule{0pt}{4ex} & & $X_3$ & $\begin{aligned} \frac{\partial Y}{\partial X_3} = & \beta_3 + \beta_5 X_2 + \\ & \beta_6 X_1 + \beta_7 X_1 X_2 \end{aligned}$ \\
\bottomrule
\end{tabular}
\end{center}
\end{table}

Consider, for example, the three regression equations shown in the first column of Table \ref{tab:formulae}.  The first includes a simple two-way multiplicative interaction term, the second includes a power term, and the third includes a three-way interaction and its constituent two-way interactions. The formula for the marginal effect with respect to each variable in each equation is shown in the third column. While the equations vary in their complexity, the key intuition to draw is that calculating a marginal effect from an equation of any complexity requires three steps:

\begin{enumerate}
\item Identify which coefficients are attached to terms that include the variable of interest, $x$
\item Using a set of derivative rules, specify the partial derivative of the equation with respect to $x$
\item Implement the partial derivative equation computationally
\end{enumerate}

As should be clear from the formulae in the third column, the first step is fairly easy (though potentially error-prone), the second requires only knowledge of fairly basic derivation, and the third requires simply expressing an additive formula as code. For example, the three marginal effects for Equation (3) in Table \ref{tab:formulae} are simply:

\begin{verbatim}
library("stats")
m <- lm(y ~ x1 * x2 * x3)
cf <- coef(m)[-1] # drop beta_0
me_x1 <- cf[1] + cf[4]*x2 + cf[6]*x3 + cf[7]*x2*x3
me_x2 <- cf[2] + cf[4]*x1 + cf[5]*x3 + cf[7]*x1*x3
me_x3 <- cf[3] + cf[5]*x2 + cf[6]*x1 + cf[7]*x1*x2
\end{verbatim}

\noindent The code necessary to perform these calculations is simple, but it is sensitive to human error. And if the model is modified, the positions of coefficients may change, breaking code. Furthermore, the second analytic step (defining the partial derivatives according to a set of derivation rules) is actually extremely complicated, at least in order to handle any general class of model formulae. The challenges are quickly apparent in the complexity of the code for \texttt{deltaMethod()} from the \textbf{car} package \citep{FoxWeisberg2011}, which attempts to implement symbolic derivation for model formulae. In computational terms, this is especially true when considering the fairly limited information contained in the formula expression used to specify the model (\texttt{y \textasciitilde x1 * x2 * x3}).

If we were able to fully expand that notation and introduce placeholders for coefficients, then R's symbolic derivative function (\texttt{deriv()}) could return accurate specifications of the three marginal effects as \textit{expressions} shown in the \texttt{.grad[, "x1"]}, \texttt{.grad[, "x1"]}, etc.:

<<deriv, echo = TRUE, results = "show">>=
deriv(y ~ b1*x1 + b2*x2 + b3*x3 + b4*x1*x2 + 
          b5*x2*x3 + b6*x1*x3 + b7*x1*x2*x3, 
      c("x1", "x2", "x3"))
@

But, this is prone to the same limitations as the manual calculation of marginal effects, shown above, wherein human error and iterative changes to the model easily produce incorrectly results.

A further challenge arises due to the terse syntax needed to express a complex model in R's modelling language --- the ``formula'' class, derived from GENSTAT \citep{WilkinsonRogers1973}. It is quite easy to express a model for which the mapping of righthand-side variables to coefficients is completely intransparent, such as \texttt{y ~ 0 - (x1 + x2) * (x3 + x4)}. Additional sources of useful generality in the language, such as the expression of ``factors'' (categorical variables via \texttt{factor()}) and on-the-fly variable transformations through \texttt{I()} statements, impede symbolic derivation. These useful features mean it is possible to express a model in formula markup for which no symbolic derivative rule is available, such as Equation (2) in Table \ref{tab:formulae}:

\begin{knitrout}
\definecolor{shadecolor}{rgb}{0.969, 0.969, 0.969}\color{fgcolor}\begin{kframe}
\begin{alltt}
deriv(y ~ b1*x1 + b2*I(x1^2) + b3*x2, c("x1", "x2"))
\end{alltt}
\begin{verbatim}
Error in deriv.formula(y ~ b1 * x1 + b2 * I(x1^2) + b3 * x2, c("x1", "x2")) : 
  Function 'I' is not in the derivatives table
\end{verbatim}
\end{kframe}
\end{knitrout}

\noindent and for which any reasonable workaround would produce an inaccurate set of symbolically arrived at partial derivatives.\footnote{The result of \texttt{deriv(y \textasciitilde b1*x1 + b2*x1\^{}2 + b3*x2, c("x1", "x2"))} is correct, but would require recognizing when \texttt{I()} is and is not meaningful in a formula and modifying it accordingly.} For example, we could define a new variable \texttt{x1squared} equal to $x_1^2$, but the symbolic differentiation rules are blind to the underlying relationship between $x_1$ and the new variable:

<<deriv2, echo = TRUE, results = "show">>=
deriv(y ~ b1*x1 + b2*x1squared + b3*x2, c("x1", "x2"))
@

\noindent The marginal effect arrived at symbolically ignores that the effect of $x_1$ depends on the value of itself because of the additional squared term. By breaking the relationship between $x_1$ and its squared term, $x_1^2$, the derivative rules lead us to a plainly inaccurate result (that the marginal effect of $x_1$ is simply $\beta_1$).

\subsection{Numerical Derivatives}\label{sec:numerical}

What then can be done? A standard answer --- and the answer chosen by Stata's \texttt{margins} command \citep{Stata11} --- is to rely on numerical derivatives (i.e., numeric approximations of the partial derivatives). The R \textbf{margins} package follows this approach.

What is a numerical approximation? Rather than defining a partial derivative as an exact formula using symbolic derivation rules, a numerical derivative approximates the slope of the response function from a model by taking small steps, $h$ in $x$, calculating the $\hat{y}$ at each point, and then applying a simple difference method to define the slope at point $x$:

\begin{equation}
f'(x) = \lim\limits_{h \rightarrow 0} \dfrac{f(x + h) - f(x)}{h}
\end{equation}

While seemingly crude, as $h \lim 0$, the result converges on the true partial derivative and every value of $x$ and requires no knowledge of formula for the partial derivative(s). To provide some intuition, Figure \ref{fig:numderiv1} displays a few numerical derivatives of $f(x) = x^2$ at the point $x = 1$ for various large values of $h$ (2.0, 1.0, 0.5, 0.25). As should be clear, as $h$ decreases, the approximations approach the true slope of the derivative, $f'(x) = 2*x$, which is 2. Inferring the partial derivative across the full domain of $x$ requires repeating this approximation process at every substantively meaningful value of $x$.

<<numderiv1, fig.cap = "Approximation of Derivative via One-Sided Numerical Approach">>=
curve(x^2, from = -5, to = 4, lwd = 2, las = 1, 
     xlim = c(-1, 4), ylim = c(0,10), 
     xaxt = "n", yaxs = "i", yaxs = "i",
     ylab = "", bty = "l")
axis(1, c(0, 1, 1.25, 1.5, 2, 3))
text(-0.25, 1, expression(f(x) == x^2))
mtext(expression(f(x)), 2, 2, las = 1)

# vertical lines
segments(1,0, 1,1, lty = 2)
segments(1.25,0, 1.25,1.5625, lty = 2)
segments(1.5,0, 1.5,2.25, lty = 2)
segments(2,0, 2,4, lty = 2)
segments(3,0, 3,9, lty = 2)

# tangents
segments(0,-1, 5,9, col = "blue", lwd = 2)
segments(1,1, 1.25, 1.5625, col = "red", lwd = 2)
segments(1,1, 1.5, 2.25, col = "red", lwd = 2)
segments(1,1, 2, 4, col = "red", lwd = 2)
segments(1,1, 3, 9, col = "red", lwd = 2)

# text
text(3.1, 9, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(3) - f(1), 2), " = 4")), pos = 4 )
text(3.1, 4, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(2) - f(1), 1), " = 3")), pos = 4 )
text(3.1, 2.5, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(1.5) - f(1), 0.5), " = 2.5")), pos = 4 )
text(3.1, 1.5, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(1.25) - f(1), 0.25), " = 2.25")), pos = 4 )
text(3.1, 0.5, "...", pos = 4)
@


At large values of $h$ and even fairly small ones, this ``one-sided'' derivative can be quite inaccurate. A ``two-sided'' or ``symmetric difference'' approach uses points above and below $x$:

\begin{equation}
f'(x) = \lim\limits_{h \rightarrow 0} \dfrac{f(x+h) - f(x-h)}{2h}
\end{equation}

This approach, which is shown in Figure \ref{fig:numderiv2}, will tend to be more accurate. As it so happens, for the function $f(x) = x^2$, this approach calculates $f'(x)$ accurately even for very large values of $h$.

<<numderiv2, fig.cap = "Approximation of Derivative via Two-Sided Numerical Approach">>=
curve(x^2, from = -5, to = 4, lwd = 2, las = 1, 
     xlim = c(-1, 4), ylim = c(0,10), 
     xaxt = "n", yaxs = "i", yaxs = "i",
     ylab = "", bty = "l")
abline(h = 0, col = "gray")
axis(1, c(0, 0.5, 1, 1.5, 2, 3))
text(-0.25, 1, expression(f(x) == x^2))
mtext(expression(f(x)), 2, 2, las = 1)

# vertical lines
segments(1,0, 1,1, lty = 2)
segments(1.5,0, 1.5,2.25, lty = 2)
segments(2,0, 2,4, lty = 2)
segments(3,0, 3,9, lty = 2)

# tangents
segments(0,-1, 5,9, col = "blue", lwd = 2)
segments(0.5,0.25, 1.5,2.25, col = "red", lwd = 2)
segments(0,0, 2,4, col = "red", lwd = 2)
segments(-1,1, 3,9, col = "red", lwd = 2)

# text
text(3.1, 9, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(3) - f(-1), 4), " = 2")), pos = 4 )
text(3.1, 4, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(2) - f(0), 2), " = 2")), pos = 4 )
text(3.1, 2.5, expression(paste(frac(paste(Delta, y), paste(Delta, x)) == frac(f(1.5) - f(0.5), 1), " = 2")), pos = 4 )
text(3.1, 0.5, "...", pos = 4)
@


Computationally, this two-sided numerical approximation is achieved via R's \texttt{predict()} method, which provides (for a given model result) the fitted value $\hat{Y}$ for every observation in a data frame.\footnote{\textbf{margins} provides a type-consistent wrapper for this, called \texttt{prediction()} that always returns a data frame (rather than a vector or list of fitted values).} In essence, \texttt{predict()} represents the equation $\hat{Y} = f(X)$ as the function of a model object (containing coefficient estimates) and a data frame (defaulting the original data used in estimation, such that \texttt{fitted(model)} and \texttt{predict(model)} are equivalent). This means that \textbf{margins} can produce marginal effects estimates for any model object class that offers a \texttt{predict()} method.

At a low level, \textbf{margins} provides a function, \texttt{dydx()}, which implements the numerical derivative procedure. Taking a model object, \texttt{model}, and data frame, \texttt{data}, as input, the function calculates a value of $h$ that accounts for floating point errors (the internal function \texttt{setstep()},\footnote{A relatively large value of $h$ is used by default: $sqrt(1x10^-7)$. Alternative values can be specified manually.} shown below), generates two data frames \texttt{d0} (representing $f(x-h)$) and \texttt{d1} (representing $f(x+h)$), calls \texttt{predict()} on \texttt{d0} and \texttt{d1}, and calculates the numerical derivative from the resulting fitted $\hat{Y}$ values according to a two-sided numerical approximation:

\begin{verbatim}
d0 <- d1 <- data
...
d0[[variable]] <- d0[[variable]] - setstep(d0[[variable]])
d1[[variable]] <- d1[[variable]] + setstep(d1[[variable]])
...
P0 <- prediction(model = model, data = d0, type = type)[["fitted"]]
P1 <- prediction(model = model, data = d1, type = type)[["fitted"]]
...
out <- (P1 - P0) / (d1[[variable]] - d0[[variable]])
\end{verbatim}

\noindent The result is a vector of marginal effects estimates --- the marginal effect at every observed combination of $X$ in the data. This low-level function is wrapped within a function \texttt{margins()}, which is an S3 generic with methods for various common modelling procedures. \texttt{margins()} calculations marginal effects of \textit{all} variables used in a model, along with their variances, and accommodates the estimation of those effects across arbitrary values of $X$ variables (emulating Stata's \texttt{margins, at()} notation). Calling \texttt{dydx()} or \texttt{marginal\_effects()} provides a means to calculate marginal effects without specifying new data or calculating variances, which can be time-consuming. The output of \texttt{margins()} is a ``data.frame'' with an additional ``margins'' class, that contains the original input data, the fitted values from the model, and the marginal effects estimates for each observation:

<<print, echo=TRUE, results = "show">>=
str(mar2 <- margins(m2))
@

\noindent Attributes contain some additional information about the procedure. By default the AMEs are printed and a \texttt{summary()} method provides a more complete output that is useful for interactive sessions:

<<summary, echo=TRUE, results = "show">>=
mar2
summary(mar2)
@

\noindent For users of Stata's \texttt{margins} command, this output should look very familiar.

A few final points about the computational details of \textbf{margins} are worth noting. First, much numerical differentiation in R is conducted using the \textbf{numDeriv} package. This approach is not used here because \textbf{numDeriv} is not vectorized and is thus quite slow. The issue arises because \textbf{numDeriv} calculates $f(x-h)$ and $f(x+h)$ via for-loop, iterating over observations in a data set. \textbf{margins} provides a significant performance enhancement by using the vectorized procedures shown above.

Second, \textbf{margins} detects the \textit{class} of variables entered into a regression, distinguishing numeric variables from factor, ordered, and logical. For the non-numeric classes, discrete differences rather than partial derivatives are reported as the partial derivative of a discrete variable is undefined. For factors (and ``ordered'') variables, changes are expressed moving from the base category to a particular category (e.g., from male to female, high school to university education, etc.). For logical variables, discrete changes are expressed moving from \texttt{FALSE} to \texttt{TRUE}. The treatment of ordered variables (in essence treating them as factors) differs from R's default behavior.

Third, the \texttt{type} argument accommodates different quantities of interest in non-linear models, such as generalized linear models. For a logistic regression model, for example, we may want to interpret marginal effects (sometimes ``partial effects'') on the scale of the observed outcome, so that we can understand the marginal effects as changes in the predicted probability of the outcome. By default, \textbf{margins} sets \texttt{type = "response"}. This can, however, be modified. For GLMs, this could be set to \texttt{type = "link"} in order to calculate true marginal effects on the scale of the linear predictor.

And, finally, instead of the default \textit{instantaneous} marginal effects, discrete changes can be requested for numeric $X$ variables, by specifying the \texttt{change} argument to the workhorse \texttt{dydx()} function, which allows for expression of changes from observed minimum to maximum, the interquartile range, mean +/- a standard deviation, or any arbitrary step.

\subsection{Variance Approximation with the Delta Method}

Calculating the variances of marginal effects is --- like the calculation of marginal effects themselves --- possible if one can easily express and compute a marginal effect \textit{symbolically}. But just as a general solution to the problem of marginal effect calculation quickly necessitated a numerical approximation, so too does the calculation of variances in that framework. 

The first step is to acknowledge that the marginal effects are nested functions of $X$. Consider, for example, Equation 1 in Table \ref{tab:formulae}, which provides two marginal effects:\footnote{It would, of course, be possible to specify marginal effects with respect to other $X$ variables but because they are not included in the regression equation, the marginal effects of all other variables are, by definition, zero.}

\begin{align}
ME(X_1) = \dfrac{\partial Y}{\partial X_1} = f_1'(X) = g_1(f(X)) \\
ME(X_2) = \dfrac{\partial Y}{\partial X_2} = f_2'(X) = g_2(f(X))
\end{align}

\noindent To calculate the variances of these marginal effects, \textbf{margins} relies on the delta method to provide an approximation (following the lead of Stata). The delta method provides that the variance-covariance matrix of the marginal effect of each variable on $Y$ is given by:

\begin{equation}
Var(ME) = J \times Var(\beta) \times J'
\end{equation}

\noindent where $Var(\beta)$ is the variance-covariance matrix of the regression coefficients, estimated by $Var(\hat{\beta})$ and the Jacobian matrix, $J$, is an $M$-x-$K$ matrix in which each row corresponds to a marginal effect and each column corresponds to a coefficient:

\begin{center}\doublespacing
J = $\begin{bmatrix}
	\frac{\partial g_1}{\partial \beta_0} & 
	\frac{\partial g_1}{\partial \beta_1} & 
	\frac{\partial g_1}{\partial \beta_2} & 
	\dots &
	\frac{\partial g_1}{\partial \beta_K} \\ 

	\frac{\partial g_2}{\partial \beta_0} & 
	\frac{\partial g_2}{\partial \beta_1} & 
	\frac{\partial g_2}{\partial \beta_2} & 
	\dots &
	\frac{\partial g_2}{\partial \beta_K} \\ 

	\dots \\ 

	\frac{\partial g_M}{\partial \beta_0} & 
	\frac{\partial g_M}{\partial \beta_1} & 
	\frac{\partial g_M}{\partial \beta_2} & 
	\dots &
	\frac{\partial g_M}{\partial \beta_K} \\ 
\end{bmatrix}$
\end{center}

Intuition surrounding the Jacobian can be challenging because the entries are partial derivatives of the marginal effects with respect to the $\beta$'s, not the $X$'s. Thus it involves the somewhat unintuitive exercise of treating the coefficients ($\beta$'s) as variables and the original data variables ($X$'s) as constants. Continuing the running example, the Jacobian for the two marginal effects of Equation (1) in Table \ref{tab:formulae} would be a 2-x-4 matrix, where the first column (expressing the partial derivative of each marginal effect with respect to the intercept) is always zero:

\begin{center}\doublespacing
J = $\begin{bmatrix}
	0 & 1 & 0 & X_2 \\
	0 & 0 & 1 & X_1 \\
\end{bmatrix}$
\end{center}

\noindent such that $Var(ME)$ is:

\begin{equation*}
 \begin{bmatrix}
	0 & 1 & 0 & X_2 \\
	0 & 0 & 1 & X_1 \\
 \end{bmatrix}
 \times
 \begin{bmatrix}
 	Var(\beta_0) & Cov(\beta_0, \beta_1) & Cov(\beta_0, \beta_2) & Cov(\beta_0, \beta_3) \\
 	Cov(\beta_0, \beta_1) & Var(\beta_1) & Cov(\beta_1, \beta_2) & Cov(\beta_1, \beta_3) \\
 	Cov(\beta_0, \beta_2) & Cov(\beta_1, \beta_2) & Var(\beta_2) & Cov(\beta_2, \beta_3) \\
 	Cov(\beta_0, \beta_3) & Cov(\beta_1, \beta_3) & Cov(\beta_2, \beta_3) & Var(\beta_1) \\
 \end{bmatrix}
 \times 
 \begin{bmatrix}
 	0 & 0 \\
 	1 & 0 \\
 	0 & 1 \\
 	X_2 & X_1 \\
 \end{bmatrix}
\end{equation*}

\noindent Multiplying this through, we arrive at a 2-x-2 variance-covariance matrix for the marginal effects:

%\begin{center}\doublespacing
%\begin{bmatrix}
%	Var(\beta_1) + X_2 Cov(\beta_1, \beta_3) & 
%	Cov(\beta_1, \beta_2) + X_2 Cov(\beta_2, \beta_3) & 
%	Cov(\beta_1, \beta_3) + X_2 Var(\beta_1) \\
%	
%	Cov(\beta_1, \beta_2) + X_2 Cov(\beta_1, \beta_3) & 
%	Var(\beta_2) + X_1 Cov(\beta_2, \beta_3) & 
%	Cov(\beta_2, \beta_3) + X_1 Var(\beta_1) \\
%\end{bmatrix}
%\end{center}

\begin{center}\doublespacing
$\begin{bmatrix}
	Var(ME(X_1)) & Cov(ME(X_1), ME(X_2)) \\
	Cov(ME(X_1), ME(X_2)) & Var(ME(X_2)) \\
\end{bmatrix}$
\end{center}

\noindent where

\begin{align*}
Var(ME(X_1)) = &\hspace{1ex} Var(\beta_1) + 2 X_2 Cov(\beta_1, \beta_3) + X_2^2 Var(\beta_1))\\
Var(ME(X_2)) = &\hspace{1ex} Var(\beta_2) + 2 X_1 Cov(\beta_2, \beta_3) + X_1^2 Var(\beta_1)\\
Cov(ME(X_1), ME(X_2)) = &\hspace{1ex} Cov(\beta_1, \beta_2) + X_2 Cov(\beta_2, \beta_3) + \\ 
	&\hspace{1ex} X_1 Cov(\beta_1, \beta_3) + X_1 X_2 Var(\beta_1) \\
\end{align*}


To achieve this computationally, \textbf{margins} uses a numerical approximation of the Jacobian. The computational details necessary to express this for any regression model are similar to those for approximating the marginal effects themselves. This is achieved by creating a ``function factory'' that accepts data and a model object as input and returns a function that holds data constant at observed values, but modifies the estimated coefficients according some new input, applying \texttt{predict()} to the original data and modified coefficients.\footnote{This re-expresses $g(x, \beta)$ as a function only of coefficients: $g(\beta)$, holding $x$ constant.} The same numerical differentiation methods as above are then applied to this function, to approximate the Jacobian.\footnote{As a computational note, \textbf{margins} uses the standard variance-covariance matrix returned by any modelling function as the value of $Var(\hat{\beta})$ but also alternative values to be specified via \texttt{vcov} argument to \texttt{margins()}.}


\section{Package Functionality}\label{sec:behavior}

At its core, \textbf{margins} offers one function: an S3 generic \texttt{margins()} that takes a model object as input and returns a list of data frames of class \texttt{"margins"}, which contain the original data, fitted values, standard errors of fitted values, marginal effects for all variables included in the model formula, and variances of those marginal effects. The internals of this function are mostly exported from the package to allow users to calculate just the marginal effects without the other data (using \texttt{marginal\_effects()}, to calculate the marginal effect of just one variable (using \texttt{dydx()}), and to plot and summarize the model and marginal effects in various ways (using \texttt{cplot()}, and \texttt{plot()}, \texttt{persp()}, and \texttt{image()} methods). Table \ref{tab:functions} provides a full list of exported functions and a brief summary of their behavior.

\begin{table}
\begin{tabular}{llp{3in}} %\toprule
\textbf{Type} & \textbf{Function} & \textbf{Behavior} \\ \midrule
\multirow{4}{*}{Core} & \texttt{dydx()} & Calculate marginal effect of one, named variable \\ \cline{2-3}
 & \texttt{marginal\_effects()} & Calculate marginal effects of all variables in a model \\ \cline{2-3}
 & \texttt{margins()} & An S3 generic to calculate marginal effects for all variables and their variances \\
\midrule

\multirow{4}{*}{Visualization} & \texttt{plot.margins()} & An analogue to Stata's \texttt{marginsplot} command that plots calculated marginal effects. \\ \cline{2-3}
 & \texttt{cplot()} & An S3 generic that plots \textit{conditional} fitted values or marginal effects across a named covariate \\ \cline{2-3}
 & \texttt{persp.lm()}, etc. & S3 methods for the \texttt{persp()} generic that provide three-dimensional representations akin to \texttt{cplot()} but for two covariates \\ \cline{2-3}
 & \texttt{image.lm()}, etc. & S3 methods for the \texttt{image()} generic that produce flat representations of the \texttt{persp()} plots \\ 
\midrule

\multirow{1}{*}{Utilities} & \texttt{build\_margins()} & The workhorse function underlying \texttt{margins()} that assembles the response \texttt{"margins"} object for one data frame input. \\
\bottomrule
\end{tabular}
\caption{Functions in the \textbf{margins} Package}\label{tab:functions}
\end{table}

At present, \texttt{margins()} methods exist for objects of class \texttt{"lm"}, \texttt{"glm"}, and \texttt{"loess"}. The \texttt{margins.default()} method may work for other object classes, but is untested. The use of the package is meant to be extremely straight forward and to be consistent across model classes. To use it, one needs only specify estimate a model using, for example, \texttt{glm()}, and then pass the resulting object to \texttt{margins()} to obtain the marginal effects estimates as \texttt{"margins"} object. For interactive use, the \texttt{summary.margins()} method will be useful:

<<margins1, echo = TRUE, results = "show">>=
library("datasets")
m <- lm(mpg ~ wt + am + factor(cyl), data = mtcars)
margins(m)
summary(margins(m))
@

\noindent By default, \texttt{print.margins()} shows the AMEs for each variable. Factor variables are handled automatically by expressing their influence as discrete changes in the outcome moving from the base category to each other category. The \texttt{summary.margins()} output shows the AMEs, along with corresponding standard errors, $z$-statistics, $p$-values, and a 95\% confidence interval.\footnote{The confidence level can be changed by setting \texttt{level = beta}, such as \texttt{summary(margins(m), level = 0.67)} to obtain a 67\% CI.}

To obtain average MERs (an AME with the value of one or more covariates replaced by a theoretically interesting value), users can specify the \texttt{at} option to \texttt{margins()}. In effect, this uses \texttt{build\_datalist()} to create a list of data frames to be used as input to \texttt{marginal\_effects()} and calculates the marginal effects of every variable separately for each dataset. This can be useful to understand, for example, what the marginal effect of a vehicle's weight is on its fuel efficiency, separately for manual and automatic vehicles, when the relationship between weight and fuel economy is possibly non-linear:

<<margins2, echo = TRUE, results = "show">>=
m <- lm(mpg ~ wt * am + I(wt^2) * am, data = mtcars)
summary(margins(m, at = list(am = 0:1)))
@

The plotting functionality provided by \textbf{margins} can be particularly useful for understanding models and the contribution of each covariate to the outcome. Consider for example, a model involving a simple interaction between two covariates. To understand the effect of each, we can calculate AMEs:

<<margins3a, echo = TRUE, results = "show">>=
m <- lm(mpg ~ wt * I(wt^2) * hp * I(hp^2), data = mtcars)
margins(m)
@

\noindent But we can also display the results visually to better understand the results. For example, \texttt{cplot()} will show the predicted value of the outcome across levels of covariates:

<<margins3b, echo = TRUE, fig.show = "hold">>=
cplot(m, "wt")
cplot(m, "hp")
@

\noindent The \texttt{persp()} method will show the same but as a three-dimensional surface and the \texttt{image()} method will present that information as a two-dimensional ``heatmap''-style format:

<<margins3c, echo = TRUE, fig.show = "hold">>=
persp(m, "wt", "hp")
image(m, "wt", "hp")
@

\noindent These plots all show fitted values, thereby indirectly communicating marginal effects. To show marginal effects themselves, one can use the \texttt{plot.margins()} method to simply show the average marginal effects visually:

<<margins3d, echo = TRUE, results = "show">>=
m <- lm(mpg ~ wt + am + factor(cyl), data = mtcars)
plot(margins(m))
@

And the \texttt{cplot()}, \texttt{persp()}, and \texttt{image()} methods can be modified to request marginal effects rather than fitted values by setting the \texttt{what = "effect"} argument:

<<margins3e, echo = TRUE>>=
m <- lm(mpg ~ wt * I(wt^2) * hp * I(hp^2), data = mtcars)
cplot(m, "wt", what = "effect")
persp(m, "wt", "hp", what = "effect", phi = 30)
image(m, "wt", "hp", what = "effect")
@

All of the plotting functions in the package return data structures, which can also be passed to other plotting tools (such as \textbf{ggplot2}). The \textbf{LinRegInteractive} package provide an alternative set of visualization methods but calculates marginal effects in a quite different way.

\subsection{Comparison with Other R Packages}

Several existing R packages attempt to estimate quantities of interest, such as marginal effects, but all have limitations. For example, the \textbf{car} package noted above is constrained by its use of symbolic derivatives to the calculation of only particular combinations of effects that can be clearly expressed symbolically. This approach appears to also be used by the \textbf{effects} package, which provides various functions for describing fitted values and producing plots from those quantities. (Much of that plotting functionality can be achieved through \textbf{margin}'s \texttt{cplot()} function.) The package is somewhat intransparent in its approach and has an extensive list of explicit limitations about functionality. \textbf{lsmeans} can be used to calculate predictive means (or ``predictive margins'') from linear, generalized linear, and mixed models, as well as calculate ``first-difference'' style contrasts between tehse predictions (something margins achieves using \texttt{dydx()} using alternative \texttt{change} arguments).

Other packages show some promise but seem to fail in common situations. For example, the \textbf{mfx} package \citep{Fernihough2014} provides several functions for calculating marginal effects from common GLMs (logit, probit, poisson, negative-binomial, beta) but does not properly account for the interdependence between terms included in multiplicative interactions. For example:

\begin{verbatim}
> library("mfx")
> library("datasets")
> mfx1 <- glm(vs ~ disp * drat, data = mtcars, family = binomial)
> logitmfx(mfx1, mtcars)
Call:
logitmfx(formula = x, data = mtcars)

Marginal Effects:
                dF/dx   Std. Err.       z  P>|z|
disp      -0.00061817  0.00721983 -0.0856 0.9318
drat      -0.24921775  0.37514481 -0.6643 0.5065
disp:drat -0.00158312  0.00233612 -0.6777 0.4980
\end{verbatim}

\noindent The \textbf{erer} package \citep{Sun2016} suffers the same error:

\begin{verbatim}
> library("erer")
> maBina(x)
> x <- glm(vs ~ disp * drat, data = mtcars, family = binomial, x = TRUE)
> maBina(x)
            effect error t.value p.value
(Intercept)  2.060 1.571   1.311   0.200
disp        -0.001 0.007  -0.086   0.932
drat        -0.249 0.375  -0.664   0.512
disp:drat   -0.002 0.002  -0.678   0.504
\end{verbatim}

\noindent The \textbf{DAMisc} package \citep{Armstrong2016} specifically offers a function, \texttt{intEff()} for handling interactions. It, unfortunately, returns an ambiguous value of the ``interaction effect'' for each observation in a dataset that does not appear to correspond to the marginal effect for either term in the model:

\begin{verbatim}
> library("DAMisc")
> head(intEff(x, c("disp", "drat"), data = mtcars))
                        int_eff        linear        phat  se_int_eff      zstat
Mazda RX4         -0.0034080006 -3.151623e-03 0.502369027 0.018382454 -0.1853942
Mazda RX4 Wag     -0.0034080006 -3.151623e-03 0.502369027 0.018382454 -0.1853942
Datsun 710        -0.0079522078 -5.750136e-04 0.952093495 0.004907381 -1.6204588
Hornet 4 Drive     0.0183101233 -2.482300e-03 0.269570094 0.021788290  0.8403653
Hornet Sportabout  0.0007426565 -3.380259e-05 0.002688532 0.001359246  0.5463739
Valiant           -0.0167213771 -1.333991e-03 0.879716441 0.008581226 -1.9486000
\end{verbatim}

\noindent Additionally, the \textbf{interflex}, \textbf{interplot} \citep{SoltHu2015}, and \textbf{plotMElm} \citep{Gandrud2016} packages offer visualization functionality specifically for displaying results of models that include multiplicative interaction terms but do not provide general interfaces for calculating marginal effects from arbitrary models. The latter package only handles linear models. \textbf{modmarg} is a newer package that appears to provide comparable functionality to margins \citep{Wang2017}. \textbf{interactionTest} \citep{EsareySumner2015} offers a useful \texttt{fdrInteraction()} for calculating a $t$-statistic that limits the false discovery rate for a marginal effect based on an interaction, but only works in the case of two-way interaction terms. Relatedly, \textbf{visreg} \citep{BrehenyBurchett2016} provides a \textbf{lattice}-based visualization function, \texttt{visreg()} to visualize predicted values from models in sometimes complex ways. \textbf{margins} attempts to emulate some of this in the \texttt{cplot()} function. \textbf{condvis} and \textbf{LinRegInteractive} offer interactive visualization functionality for examining ``conditional expectation'' plots and surfaces, akin to both \texttt{cplot()} and the \texttt{persp()} methods offered in \textbf{margins}.

\subsection{Comparison with Stata}

As an aside, Stata relies on symbolic derivatives for calculating marginal effects for OLS models (and numerical derivatives in all other cases). This is made possibly by the fact that Stata offers a much less expressive modelling language that broadly allows only two variable times: continuous (denoted by a \texttt{c.} prefix) and factor (denoted by a \texttt{i.} prefix), enables simple interactions either explicitly (as \texttt{x1\#x2}, akin to R's \texttt{~ x1:x2}) or implicitly (as \texttt{x1\#\#x2}, akin to R's \texttt{~ x1*x2}), and disallows formula-based transformations (R's \texttt{I()} notation). Thus the calculation of marginal effects is simplified by significantly constraining the number of possible models that can be specified and thus the necessary sophistication of a symbolic derivation procedure.

The result is a syntax that can quickly and easily be used to calculate marginal effects for essentially any regression model:

\begin{verbatim}
. quietly reg pop gdpPercap
. margins, dydx(*)

Average marginal effects                        Number of obs     =      1,704
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : gdpPercap

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   gdpPercap |  -275.6895   260.9549    -1.06   0.291    -787.5156    236.1366
------------------------------------------------------------------------------
\end{verbatim}

A downside of Stata's limited expressiveness becomes obvious when one considers a variable that is transformed for some specific modelling purpose but for which substantive interpretations are desired on the original scale. For example, one may log transform a covariate but desire to know the marginal effect of that variable (rather than the marginal effect of its logged form). Stata provides no easy means to achieve this; it only allows the pre-calculation of a new variable, breaking the program's ability to recognize a relationship between a variable and its transformed form. \textbf{margins}, by contrast, makes this simple. Consider, for example, the two regression models show in Table \ref{tab:model3}, which are identical except that model (1) is calculated from a pre-transformed value of GDP per capita and model (2) is calculated from a transformation expressed via \texttt{I()} notation in a regression formula.\footnote{The models are: (1) \texttt{(pop \textasciitilde loggdp} and (2) \texttt{pop \textasciitilde I(log(gdpPercap))} .}

<<log1, results = "asis">>=
m3a <- lm(pop ~ loggdp, data = gapminder)
m3b <- lm(pop ~ I(log(gdpPercap)), data = gapminder)
stargazer(m3a, m3b, dep.var.labels = "Population Size",
          title = "Example of Log Transformation", 
          label = "tab:model3")
@

What effect does GDP per capita have on the outcome? Calculating marginal effects reveals the answer. Because we have specified the transformation using \texttt{I()} notation in model (2), \textbf{margins} can quickly identify the contribution of the original variable:

<<log2, echo = TRUE, results = "show">>=
summary(margins(m3b))
@

\noindent And this result will differ from that for model (2) except in the resulting z-statistic and p-value which --- as should be the case --- are identical in the two marginal effect calculations:

<<log3, echo = TRUE, results = "show">>=
summary(margins(m3a))
@

\noindent Despite some of the underlying limitations, Stata's \texttt{margins} command is incredibly user friendly and easy-to-use. Its output is also clean and intuitive. As a result, the behavior of \textbf{margins} try (as closely as possible) to mimic the behavior. It does not attempt, however, to provide: (1) an easy way of calculating MEMs (as Stata does with the \texttt{, atmeans} option), (2) calculating of predicted values (since R already provides this via \texttt{predict()}), or (3) cover the full class of model types that Stata currently supports. One other key advantage of the R implementation is that because it relies on a fully functional programming paradigm, marginal effects can easily be calculated for multiple objects, whereas Stata's approach can only calculate effects for the \textit{previous} modelling command using stored results.

%\section{Applications}


\section{Conclusion}

Average marginal effects offer an intuitive technique for interpreting regression estimates from a wide class of linear and generalized linear models. While Stata has offered a simple and general computational interface for extracting these quantities of interest from regression models for many years, the approach has not been widely available in other statistical software. The \textbf{margins} port to R makes this functionality much more widely available. By describing the computational approach used by both packages, this article offers users of other languages guidelines for how to apply the approach elsewhere while offering applied data analysts a straightforward explanation of the marginal effect quantity and its derivation.

At present, \textbf{margins} estimates quantities of interest for a wide array of model formulae used in least squares regression and many common generalized linear models. Stata's \texttt{margins} and Zelig/Clarify produce quantities of interest for a wider array of model types. Extension of \textbf{margins} to other model types is planned for the future. The creation of the core \texttt{margins} function as an S3 generic means that the package is easily extensible to other model types (e.g., those introduced in other user-created packages). Development of \textbf{margins} is handled on GitHub, allowing for easy contribution of bug fixes, enhancements, and additional model-specific methods. By publishing \textbf{margins} as free and open-source software (FOSS), it should be straightforward for users of other languages (Python, julia, etc.) to implement similar functionality. Indeed, the port of closed source statistical software to open source represents an underappreciated by critical step in making FOSS data analysis more accessible to those used to working with closed source products.

For applied data analysis, the most important feature of \textbf{margins} is its intuitive use and the near-direct translation of Stata code into R. For those used to Stata's \texttt{margins} command, R's \textbf{margins} package should be a painless transition. For R users not accustomed to calculating marginal effects, \textbf{margins} should also offer a straightforward and tidy way of calculating predicted values and marginal effects, and displaying the results thereof.


\singlespacing
\bibliographystyle{plain}
\bibliography{references}
\clearpage


\end{document}