I believe there are some key things to consider when we deal with situations where we have missing data when we are 1) estimating a model to be used for prediction and 2) using the model to predict new cases which also may have missing values for key predictor variables. For purposes of this discussion, I'm thinking specifically about situations where one intends to use ML or FIML to estimate the parameters that define or train your model. However, if you are using some other method, you still must consider how to handle missing values in both the model training and scoring exercises. Also, there may be distinctions to consider in a purely predictive/machine learning application vs. causal inference.

But for this conversation, I will start with discussion around maximum likelihood estimation.

**Standard Maximum Likelihood:**

Maximize L =
Π f(y,x

_{1},…x_{k};β)*With standard ML, the likelihood function is optimized providing the values for β which define our regression model. (like Y =*β

_{0}+ β

_{1}x

_{1 }

*+ … β*

_{k}x

_{k }

*+ e)*

As is the case in many modeling scenarios, with standard MLE, only complete cases are used to estimated the model. That is, for each 'row' or individual case, all values of 'x' and 'y' must be defined. If a single explanatory variable 'x' or the dependent variable 'y' have a missing value, then that individual/case/row is excluded from the data. This is referred to as listwise deletion. In many scenarios, this can be undesirable because for one thing, you are reducing the amount of information used to estimate you model. Paul Allison has a very informative discussion of this in a recent post at Statistical Horizons.

**Full Information Maximum Likelihood (FIML):**

Maximize L =
Π f(y,x

_{1},…x_{k};β) Π f(y,x_{3},…x_{k};β)*Full information maximum likelihood is an estimation strategy that allows for us to get parameter estimates even in the presence of missing data. The overall likelihood is the product of the likelihoods specified for all observations. If there are m observations with no missing values but n observations missing x*

_{1}and x_{2 }we account for that by specifying the overall likelihood function as the product of two terms i.e. likelihood function is specified as a product of likelihoods for both complete and incomplete cases. In the example above the second term in the product depicts a case where for individual ‘i’ there are missing values for the first 3 variables. The first term represents the likelihood for all other complete cases. The overall likelihood is then optimized providing the values for β which define our regression model.
Both ML and
FIML are methods for estimating parameters; they are not imputation procedures
per-say. As Karen Grace Martin (Analysis Factor) aptly puts it

*“This method does not impute any data, but rather uses each case's available data to compute maximum likelihood estimates.”***Predictive Modeling Applications**

So if we
have missing data, we could use FIML to obtain parameter estimates for a model,
but what if we actually want to predict outcomes ‘y’ for some new data set. By
assumption, if we are trying to predict ‘y’ we don’t have values for y in our
data set. We will attempt to take the
model or parameter estimates we got from FIML and predict Y based on the
estimated values of

*our β’s and observed x’s. But what if in the new data we have missing x’s? Can’t we just use FIML to get our model and predictions? No. First we have already derived our model via FIML using our original or**training data*. Again, FIML is a model or parameter estimation procedure. To apply FIML in our new data set would imply 2 things:
2) We have observed values for what we
are trying to predict ‘y’ which by assumption we don’t have that! So there is
no way to properly specify the likelihood to even implement FIML.

But, we don't want to estimate a new model in the first place. If we
want to make new predictions based on our original model estimated using
maximum likelihood, we have to utilize some type of actual imputation procedure
to derive values for missing x’s. This
may often be the case for scoring or targeting customers for special promotions
where original scoring models used income data, but have to impute income for
some customers based on relationships with other available data sources like
income from zip code tabulation area etc.

**References:**

SAS Global Forum Paper
312-2012

Handling Missing Data by Maximum
Likelihood

Paul D. Allison,
Statistical Horizons, Haverford, PA, USA

Two Recommended Solutions
for Missing Data: Multiple Imputation and Maximum Likelihood. Karen
Gace-Martin. The Analysis Factor: http://www.theanalysisfactor.com/missing-data-two-recommended-solutions/
Accessed 8/14/14

Listwise Deletion: It's Not Evil. Paul Allison, Statistical Horizons. June 13,2014. http://www.statisticalhorizons.com/listwise-deletion-its-not-evil

Listwise Deletion: It's Not Evil. Paul Allison, Statistical Horizons. June 13,2014. http://www.statisticalhorizons.com/listwise-deletion-its-not-evil