Tuesday, December 29, 2015

Nonparametric Approaches to Multiple Comparisons

I have recently started reading "Applied Nonparametric Econometrics", and was thinking, when was the last time I even worked with basic non-parametric statistics?

For instance, in the courses I teach, I don't cover this, but some of the texts I reference cover some basics like the Mann Whitney Wilcoxon  (MWW) test (which can be thought of as a non-parametric equivalent to a two sample independent t-test) or the Kruskall-Wallis test (which is a non-parametric analogue to analysis of variance). These tests are often useful in situations that involve highly skewed, non-normal, or categorical ordered or  ranked data, or data from problematic or unknown distributions.  I kind of briefly reviewed some implementations in SAS, and particularly focused on the Kruskall-Wallis test, which has the following general null hypothesis:

Ho: All Populations Are Equal
Ha: All Populations Are Not Equal

If we reject Ho, we might conclude that there is a difference among populations, with one population or another providing a larger proportion of larger or smaller values for the variable of interest. If we could assume that the populations were of similar shape and symmetry, this *might* be interpreted as a test of differences in medians, but in general this is a test on differences in distributions and specifically ranks, similar to the MWW test. But if we do reject Ho, what next? In an analysis of variance context, if we reject the overall F-test on multiple means we can followup with pairwise comparisons to determine which means differ.  But at least in the older versions of SAS, there are no straightforward ways to do this kind of analysis in the non-parametric context. However, in the SAS Note (22620), one recommendation is to rank-transform the data and use the normal-theory methods in PROC GLM (Iman, 1982). See also Conover, W. J. & Iman, R. L. (1981) referenced below.

A good example of the application of GLM on ranked data can be found here: http://people.stat.sc.edu/Hitchcock/soil_KW_sasexample705.txt 

and a general overview of some non-parametric applications in SAS along these lines here.

You can also find a SAS macro with code and examples for post hoc tests here: http://www.alanelliott.com/kw/

I at first thought this was the macro by Juneau (in the references below and mentioned in the SAS note above) but it is something different, see the Elliot and Hynan reference below. From the abstract:

"The Kruskal-Wallis (KW) nonparametric analysis of variance is often used instead of a standard one-way ANOVA when data are from a suspected non-normal population. The KW omnibus procedure tests for some differences between groups, but provides no specific post hoc pair wise comparisons. This paper provides a SAS(®) macro implementation of a multiple comparison test based on significant Kruskal-Wallis results from the SAS NPAR1WAY procedure. The implementation is designed for up to 20 groups at a user-specified alpha significance level. A Monte-Carlo simulation compared this nonparametric procedure to commonly used parametric multiple comparison tests."

I found an application referencing this implementation here if interested.

According to the SAS note referenced above, SAS/STAT 12.1 will include some versions of some non-parametric post hoc tests. I'm also aware that there are several R packages that can do this as well, such as the dunn.test package.

I compared results from Elliot and Hynan's example code (example 1) and data to those from the adhoc GLM on ranks following Hitchcock and got similar results. I also got similar results using dunn.test in R:

# use same data as in www.alanelliott.com/kw
 
race <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
bmi <- c(32,30.1,27.6,26.2,28.2,26.4,23.1,23.5,24.6,24.3,24.9,25.3,23.8,22.1,23.4)
 
library(dunn.test) #load package
 
dunn.test(bmi,race, kw = TRUE, method ="bonferroni") # implement test with adjustments for multiple comparisons
 
Created by Pretty R at inside-R.org
References:

Palomares-Rius JE, Castillo P, Montes-Borrego M, Navas-Cortés JA, Landa BB (2015) Soil Properties and Olive Cultivar Determine the Structure and Diversity of Plant-Parasitic Nematode Communities Infesting Olive Orchards Soils in Southern Spain. PLoS ONE 10(1): e0116890. doi:10.1371/journal.pone.0116890

Dunn, O.J. “Multiple comparisons using rank sums”.
Technometrics 6 (1964) pp. 241-252.

Conover, W. J. & Iman, R. L. (1981). "Rank transformations as a bridge between parametric and
nonparametric statistics". American Statistician 35 (3): 124–129. doi:10.2307/2683975

Elliott AC, Hynan LS. “A SAS Macro implementation of a Multiple Comparison post hoc test for a Kruskal-Wallis analysis,” Comp Meth Prog Bio, 102:75-80, 2011

Iman, R.L. (1982), "Some Aspects of the Rank Transform in Analysis of Variance Problems," Proceedings of the Seventh Annual SAS Users Group International Conference, 7, 676-680.

Juneau, P. (2004), "Simultaneous Nonparametric Inference in a One-Way Layout Using the SAS System," Proceedings of the PharmaSUG 2004 Annual Conference, Paper SP04.

Sunday, December 6, 2015

Do We Really Need Zero-Inflated Models?-Paul Allison

Paul Allison discusses zero inflated vs negative binomial models in a post I stumbled across recently. Also William Greene and Paul go back and forth on some technical distinctions and nuances (which may be quite important) in the comments.

http://m.statisticalhorizons.com/?url=http%3A%2F%2Fstatisticalhorizons.com%2Fzero-inflated-models

"In all data sets that I've examined, the negative binomial model fits much better than a ZIP model, as evaluated by AIC or BIC statistics. And it's a much simpler model to estimate and interpret. So if the choice is between ZIP and negative binomial, I'd almost always choose the latter."

"But what about the zero-inflated negative binomial (ZINB) model? It's certainly possible that a ZINB model could fit better than a conventional negative binomial model regression model. But the latter is a special case of the former, so it's easy to do a likelihood ratio test to compare them (by taking twice the positive difference in the log-likelihoods). In my experience, the difference in fit is usually trivial..."

"So next time you're thinking about fitting a zero-inflated regression model, first consider whether a conventional negative binomial model might be good enough. Having a lot of zeros doesn't necessarily mean that you need a zero-inflated model."

Saturday, December 5, 2015

Do Friends Let Friends Do IV...or is all of that unobserved heterogeneity and endogeneity all in your head?

A few weeks ago, there was a post that caught my attention at the 'Kids Prefer Cheese' blog titled "Friends don't let Friends do IV" which was very critical of instrumental variable techniques. Around that same time, Marc Bellemare posted a contrasting piece, titled "Friends do let Friends do IV".

For some reason, I've written a number of posts recently related to instrumental variables, discussing different intuitive approaches to understanding them, or connections with directed acyclic graphs (DAGs).   In the past, I have discussed them in the context of omitted variable bias and unobserved heterogeneity and endogeneity.

Now some colleagues have introduced me to a few papers authored by Quin that really question the validity of using instruments in this context. In the first paper, Resurgence of the Endogeneity-Backed Instrumental Variable Methods, Quin states:

“Essentially, the paranoia grows out of the fallacy that independent error terms exist prior to model specification and carry certain ‘structural’ interpretation similar to other economic variables…..In fact, it is practically impossible to validate the argument of endogeneity bias on the ground of correlation between a regressor and the error term in a multiple regression setting, especially when the model fit remains relatively low. Notice how much the basis of the IV treatment for ‘selection on the unobservables’ is weakened once 'e' is viewed as a model-derived compound of unspecified miscellaneous effects. In general, error terms of statistical models are derived from model specification. As such, they are unsuitable for any ‘structural’ interpretation, e.g. see Qin and Gilbert (2001)”

Quin goes deeper into this in a later working paper, Time to Demystify Endogeneity Bias.

From the abstract-

"This study exposes the flaw in defining endogeneity bias by correlation between an explanatory variable and the error term of a regression model. Through dissecting the links which have led to entanglement of measurement errors, simultaneity bias, omitted variable bias and self
 -selection bias, the flaw is revealed to stem from a Utopia mismatch of reality directly with single explanatory variable models."


The paper gets pretty heavy in details, despite promises to keep the math at a minimum. One of the central arguments they make about "endogeniety bias syndrome" is to point out an apparent misunderstanding or misinterpretation of error terms in multivariable vs single variable regression that is often used in applied work to set the stage for doing IV:

"Error terms or model residuals have been long perceived as sundry composites of what modellers are unable and/or uninterested to explain since Frisch’s time....Since cov(z,e)≠ 0 is single variable based, the contents of the error term have to be adequately ‘pure’, definitely not a mixture of sundry composites, to sustain its significant presence.  Indeed,  textbook  discussions  of  endogeneity  bias,  be  it  associated  with  SB (simultaneity bias), measurement errors, OVB(omitted variable bias) or SSB (self-selection bias), are all built on simple regression models. As soon as these models are extended to multiple ones, the correlation becomes mathematically intractable. In a multiple regression, all the explanatory variables are mathematically equal. Designation of one  as  the  causing  variable  of  interest  and  the  rest  as  control  variables  is  purely  from  the substantive  standpoint.  The  premise, cov(x,e)≠ 0, implies  not  only cov(z,e)≠ 0 for  the entire  set  of control  variables,  but  also  the  set  being  exhaustive.  Both  conditions  are  almost impossible to meet in practice."

Quin also has an applied paper related to wage elasticities where some of these ideas are put into context. See the references below.

References:

Duo Qin (2015). Resurgence of the Endogeneity-Backed Instrumental Variable Methods. Economics: The Open-Access, Open-Assessment E-Journal, 9 (2015-7): 1—35. http://dx.doi.org/10.5018/economics-ejournal.ja.2015-7 

QIN, D. (2015) “Time  to  Demystify  Endogeneity  Bias” SOAS
Department  of  Economics  Working
Paper Series, No. 192, The School of Oriental and African Studies
192 Time to Demystify Endogeneity Bias (pdf)

Qin, D., S. van Huellen and QC. Wang. (2014), “What Happens to Wage Elasticities When We  Strip  Playometrics?  Revisiting  Married  Women  Labour  Supply  Model”, SOAS Department  of  Economics  Working  Paper  Series,  No.  190,  The  School  of  Oriental  and
African Studies  https://www.soas.ac.uk/economics/research/workingpapers/file97784.pdf 




Saturday, November 28, 2015

Econometrics, Multiple Testing, and Researcher Degrees of Freedom

Some have criticized that econometrics courses often give too much emphasis to things like heteroskedasticity, and multicollinearity, or  clinical concerns about linearity. Maybe even at the expense of more important concerns  related to causality and prediction. 

On the other hand, the experimental design courses I took in graduate school provided a treatment of multiple testing; things like bonferroni adjustments in an analysis of variance setting. And in a non-inferential, predictive modeling context, bonferroni and kass adjustments are key in implementations of some decision tree models I have implemented. But not so much in a lot of econometrics work that I have seen.

Why the gap in emphasis on multiple testing? Probably because a lot of what I have read (or work that I have done) involves regressions with binary treatment indicators. The emphasis is almost entirely on a single test of significance related to the estimated regression coefficient...or so it would seem. More on this later. 

But I have spent more and more time in the last couple years in the literature related to epidemiology, health, and wellness research. In one particular article, the authors noted, "Because of the exploratory character of the study, no adjustments for multiple hypotheses testing were performed." (Bender, et al 2002). They cited an article (Bender et al, 2001). In this article a distinction was made between multiple testing adjustments for inferential confirmatory studies vs. what might be characterized as more exploratory work.

"Exploratory studies frequently require a flexible approach for design and analysis. The choice and the number of tested hypotheses may be data dependent, which means that multiple significance tests can be used only for descriptive purposes but not for decision making, regardless of whether multiplicity corrections are performed or not. As the number of tests in such studies is frequently large and usually a clear structure in the multiple tests is missing, an appropriate multiple test adjustment is difficult or even impossible. Hence, we prefer that data of exploratory studies be analyzed without multiplicity adjustment. “Significant” results based upon exploratory analyses should clearly be labeled as exploratory results. To confirm these results the corresponding hypotheses have to be tested in further confirmatory studies."

They certainly follow their own advice in the 2001 paper. De Groot provides some great context around making distinctions between confirmatory and exploratory analysis. De Groot describes explorotory analysis as follows: 

"the material has not been obtained specifically and has not been processed specifically as concerns the testing of one or more hypotheses that have been precisely postulated in advance. Instead, the attitude of the researcher is: “This is interesting material; let us see what we can find.” With this attitude one tries to trace associations (e.g., validities); possible differences between subgroups, and the like. The general intention, i.e. the research topic, was probably determined beforehand, but applicable processing steps are in many respects subject to ad- hoc decisions. Perhaps qualitative data are judged, categorized, coded, and perhaps scaled; differences between classes are decided upon “as suitable as possible”; perhaps different scoring methods are tried along-side each other; and also the selection of the associations that are researched and tested for significance happens partly ad-hoc, depending on whether “something appears to be there”, connected to the interpretation or extension of data that have already been processed."

"...it does not so much serve the testing of hypotheses as it serves hypothesis-generation, perhaps theory-generation — or perhaps only the interpretation of the available material itself."


Gelman gets at this in his discussion of multiple testing and researcher degrees of freedom (See the Garden of Forking Paths). But, the progress of science might not be possible without some flavor of multiple testing, and to tie your hands with strict and clinical adjustment processes might hinder important work.

"At the same time, we do not want demands of statistical purity to strait-jacket our science. The most valuable statistical analyses often arise only after an iterative process involving the data" (see, e.g., Tukey, 1980, and Box, 1997).

What Gelman addresses in this paper goes beyond a basic discussion of failing to account for multiple comparisons or even multiple hypotheses:

"What we are suggesting is that, given a particular data set, it is not so difficult to look at the data and construct completely reasonable rules for data exclusion, coding, and data analysis that can lead to statistical significance—thus, the researcher needs only perform one test, but that test is conditional on the data…Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses. They would feel no sense of choice or “fishing” or “p-hacking”—even though different data could have led to different choices, each step of the way....to put it another way, we view these papers—despite their statistically significant p-values—as exploratory, and when we look at exploratory results we must be aware of their uncertainty and fragility."

This is starting to sound familiar to me. Looping back to a discussion about applied econometrics, this reminds me a lot of the EconTalk podcast discussion between Russ Roberts and Ed Leamer. They discuss something very similar to what I think Gelman is getting at. They suggest that a lot of empirical work has a very explorotory flavor to it that needs admitting. Leamer recognized this a long time ago in his essay about taking the con out of econometrics.

"What is hidden from us as the readers and is the unspoken secret Leamer is referring to in his 1983 article, is that we don't get to go in the kitchen with the researcher. We don't see all the different regressions that were done before the chart was finished. The chart was presented as objective science. But those of us who have been in the kitchen--you don't just sit down and say you think these are the variables that count and this is the statistical relationship between them, do the analysis and then publish it. You convince yourself rather easily that you must have had the wrong specification--you left out a variable or included one you shouldn't have included. Or you should have added a squared term to allow for a nonlinear relationship. Until eventually, you craft, sculpt a piece of work that is a conclusion; and you publish that. You show that there is a relationship between A and B, x and y. Leamer's point is that if you haven't shown me all the steps in the kitchen, I don't really know whether what you found is robust. "

Going back to Gelman's garden of forking paths, he also seems to suggest in a sense that the solution is to in fact show all of the steps in the kitchen, or make sure that the dish can be replicable:

"external validation which is popular in statistics and computer science. The idea is to perform two experiments, the first being exploratory but still theory-based, and the second being purely confirmatory with its own preregistered protocol."

So, in econometrics, even if all I am after is a single estimate of a given regression coefficient, multiple testing and researcher degrees of freedom may actually become quite a relevant concern, despite the minimal treatment in many econometrics courses, textbooks and literature. Since Leamer's article, and the credibility revolution, sensitivity analysis and careful identification have certainly been more prevalent in lots of empirical work. Showing all the steps in the kitchen, providing external validity and or explicitly recognizing the exploratory nature of your work (like in Bender, 2002) appear to be the best ways of dealing with this. But its not yet true in every case, and this reveals the fragility in a lot of empirical work that prudence would require us to view with a critical eye when it comes to important policy papers.

See also:

In God We Trust, All Others Show Me Your Code

Pinning p-values to the wall

References

Am J Epidemiol. 2002 Aug 1;156(3):239-45.
Body weight, blood pressure, and mortality in a cohort of obese patients.
Bender R1, Jöckel KH, Richter B, Spraul M, Berger M.


J Clin Epidemiol. 2001 Apr;54(4):343-9.
Adjusting for multiple testing--when and how?
Bender R1, Lange S.


The Meaning of "Significance" for Different Types of Research. A.D. de Groot. 1956.

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ``fishing expedition'' or ``p-hacking'' and the research hypothesis was posited ahead of time (Andrew Gelman and Eric Loken) 

"Let's Take the 'Con' Out of Econometrics," by Ed Leamer. The American Economic Review, Vol. 73, Issue 1, (Mar. 1983), pp. 31-43. 

"The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics," by Joshua Angrist and Jörn-Steffen Pischke. NBER Working Paper No. 15794, Mar. 2010.

Wednesday, November 11, 2015

Directed Acyclical Graphs (DAGs) and Instrumental Variables

Previously I discussed several of the most useful descriptions of instrumental variables that I have encountered through various sources. I was recently reviewing some of Lawlor's work related to Mendelian instruments and realized this was the first place I have seen the explicit use of directed acyclical graphs to describe how instrumental variables work.





In describing the application of Mendelian instruments, Lawlor et al present instrumental variables with the aid of directed acyclic graphs. They describe an instrumental variable (Z) as depicted above in the following way, based on three major assumptions:

(1) Z is associated with the treatment or exposure of interest (X)
(2) Z is independent of the unobserved  confounding factors (U) that impact both X and the outcome of interest (Y).
(3) Z is independent of both the outcome of interest Y given X, and the unobservable factors U. (i.e. this is the ‘exclusion principle’ in that Z impacts Y only through X)

Our instrumental variable estimate, βIV is the ratio of E[Y|Z]/E[X|Z],  which can be estimated by two-stage least squares:

X* = β0 + β1 Z +  e
Y = β0 + βIV X* + e

The first regression gets only variation in our treatment or exposure of interest related to Z, and leaves all the variation related to U in the residual term. The second regression estimates βIV, and retains only the ‘quasi-experimental’ variation in X related to the instrument Z.

References:
Stat Med. 2008 Apr 15;27(8):1133-63. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Lawlor DA, Harbord RM, Sterne JA, Timpson N, Davey Smith G. Link: http://www.ncbi.nlm.nih.gov/pubmed/17886233

Causal diagrams for empirical research
BY JUDEA PEARL. Biometrika (1995),82,4,pp.669-710

Wednesday, November 4, 2015

Instrumental Explanations of Instrumental Variables

I have recently discussed Marc Bellemare's 'Metrics Monday posts, but he's written many many more applied econometrics posts that are really really good. Another example is his post, Identifying Causal Relationships vs. Ruling Out All Other Possible Causes. The post is not about instrumental variables per say, but, in this post, he describes IVs in this way:

"As many readers of this blog know, disentangling causal relationships from mere correlations is the goal of modern science, social or otherwise, and though it is easy to test whether two variables x and y are correlated, it is much more difficult to determine whether x causes y. So while it is easy to test whether increases in the level of food prices are correlated with episodes of social unrest, it is much more difficult to determine whether food prices cause social unrest."
 

"In my work, I try to do so by conditioning food prices on natural disasters. To make a long story short, if you believe that natural disasters only affect social unrest through food prices, this ensures that if there is a relationship between food prices and social unrest, that relationship is cleaned out of whatever variation which is not purely due to the relationship flowing from food prices to social unrest. In other words, this ensures that the estimated relationship between the two variables is causal. This technique is known as instrumental variables estimation."

The idea of 'cleaning' out the bias or endogeneity etc. is consistent with how I tried to build intuition for IVs before  depicting an instrumental variable as being like a 'filter' that picks up only variation in the treatment (CAMP) unrelated to an omitted variable (INDEX) or selection bias.

"A very non-technical way to think about this is that we are taking Z and going through CAMP to get to Y, and bringing with us only those aspects of CAMP that are unrelated to INDEX.  Z is like a filter that picks up only the variation in CAMP (what we may refer to as ‘quasi-experimental variation) that we are interested in and filters out the noise from INDEX.  Z is technically related to Y only through CAMP."

Z →CAMP→Y   


 (you can read the full post for more context)

See also: http://econometricsense.blogspot.com/2013/06/unobserved-heterogeneity-and-endogeneity.html


Below are some more examples of discussions and descriptions of instrumental variables that have been the most beneficial to my understanding:

Kennedy: “The general idea behind this estimation procedure is that it takes the variation in the explanatory variable that matches up with variation in the instrument (and so is uncorrelated with the error), and uses only this variation to compute the slope estimate. This in effect circumvents the correlation between the error and the troublesome variable, and so avoids the asymptotic bias”

Mastering Metrics:“The instrumental variables (IV) method harnesses partial or incomplete random assignment, whether naturally occurring or generated by researchers….."

“The IV method uses these three assumptions to characterize a chain reaction leading from the instrument to student achievement. The first link in this causal chain-the first stage-connects randomly assigned offers with KIPP attendance, while the second link-the one we’re after-connects KIPP attendance with achievement.”


Dr. Andrew Gelman with comments from Hal Varian: How to think about instrumental variables when you get confused

“Suppose z is your instrument, T is your treatment, and y is your outcome. So the causal model is z -> T -> y……. when I get stuck, I find it extremely helpful to go back and see what I've learned from separately thinking about the correlation of z with T, and the correlation of z with y. Since that's ultimately what instrumental variables analysis is doing.”


"You have to assume that the only way that z affects Y is through the treatment, T. So the IV model is
T = az + e
y = bT + d

It follows that
E(y|z) = b E(T|z) + E(d|z)
Now if we
1) assume E(d|z) = 0
2) verify that E(T|z) != 0
we can solve for b by division. Of course, assumption 1 is untestable.
An extreme case is a purely randomized experiment, where e=0 and z is a coin flip."

References:
A Guide to Econometrics. Peter Kennedy.
Mastering 'Metrics. Joshua Angrist and Jörn-Steffen Pischke 

'Big' Data vs. 'Clean' Data

I've previously written about the importance of data cleaning, and recently I was reading a post- Data Science Can Transform Agriculture, If We Get It Right on FarmLink's blog and I was impressed by the following:

"We believe it's a transformational time for this industry – call it Ag 3.0 – when the combination of human know-how and insight, coupled with robust data science and analytics will change the productivity, profitability and sustainability of agriculture."

This reminds me, as I have discussed before in relation to big data in agriculture, of Economist Tyler Cohen's comments on an EconTalk podcast, " the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation."

 But in relation to data cleaning, I thought this was really impressive:

"...we disqualified more than two-thirds of the data collected during our first year. Now, we inspect each combine before use following a 50 point check list to identify any problem that could affect accuracy of collection, have developed a world class Quality Assurance process to test the data, and created IP addressable access to our combines to be able to identify and compensate for operator error. As a result, last year over 95% of collected data met our standard for being actionable. Admittedly, our first year data was “big.” But we chose to view it as largely worthless to our customers, just as much of the data being collected through farmer exchanges, open APIs, or memory sticks for example will be. It simply lacks the rigor to justify use in such important undertakings."

It takes patience and discipline to sometimes to make the necessary sacrifices and put the necessary resources into data quality, and it looks like this company gets it. Data cleaning isn't just academic. It's serious. Maybe it's time to replace #BigData with #CleanData.
 
Related:
Big Data
Data Cleaning
Got Data? Probably not like your econometrics textbook!
Big Ag Meets Big Data (Part 1 & Part 2)
Big Data- Causality and Local Expertise are Key in Agronomic Applications
Big Ag and Big Data-Marc Bellemare
Big Data, IoT, Ag Finance, and Causal Inference
In God we trust, all others show me your code.
Data Science, 10% inspiration, 90% perspiration