Wednesday, November 11, 2015

Directed Acyclical Graphs (DAGs) and Instrumental Variables

Previously I discussed several of the most useful descriptions of instrumental variables that I have encountered through various sources. I was recently reviewing some of Lawlor's work related to Mendelian instruments and realized this was the first place I have seen the explicit use of directed acyclical graphs to describe how instrumental variables work.

In describing the application of Mendelian instruments, Lawlor et al present instrumental variables with the aid of directed acyclic graphs. They describe an instrumental variable (Z) as depicted above in the following way, based on three major assumptions:

(1) Z is associated with the treatment or exposure of interest (X)
(2) Z is independent of the unobserved  confounding factors (U) that impact both X and the outcome of interest (Y).
(3) Z is independent of both the outcome of interest Y given X, and the unobservable factors U. (i.e. this is the ‘exclusion principle’ in that Z impacts Y only through X)

Our instrumental variable estimate, βIV is the ratio of E[Y|Z]/E[X|Z],  which can be estimated by two-stage least squares:

X* = β0 + β1 Z +  e
Y = β0 + βIV X* + e

The first regression gets only variation in our treatment or exposure of interest related to Z, and leaves all the variation related to U in the residual term. The second regression estimates βIV, and retains only the ‘quasi-experimental’ variation in X related to the instrument Z.

Stat Med. 2008 Apr 15;27(8):1133-63. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Lawlor DA, Harbord RM, Sterne JA, Timpson N, Davey Smith G. Link:

Causal diagrams for empirical research
BY JUDEA PEARL. Biometrika (1995),82,4,pp.669-710

Wednesday, November 4, 2015

Instrumental Explanations of Instrumental Variables

I have recently discussed Marc Bellemare's 'Metrics Monday posts, but he's written many many more applied econometrics posts that are really really good. Another example is his post, Identifying Causal Relationships vs. Ruling Out All Other Possible Causes. The post is not about instrumental variables per say, but, in this post, he describes IVs in this way:

"As many readers of this blog know, disentangling causal relationships from mere correlations is the goal of modern science, social or otherwise, and though it is easy to test whether two variables x and y are correlated, it is much more difficult to determine whether x causes y. So while it is easy to test whether increases in the level of food prices are correlated with episodes of social unrest, it is much more difficult to determine whether food prices cause social unrest."

"In my work, I try to do so by conditioning food prices on natural disasters. To make a long story short, if you believe that natural disasters only affect social unrest through food prices, this ensures that if there is a relationship between food prices and social unrest, that relationship is cleaned out of whatever variation which is not purely due to the relationship flowing from food prices to social unrest. In other words, this ensures that the estimated relationship between the two variables is causal. This technique is known as instrumental variables estimation."

The idea of 'cleaning' out the bias or endogeneity etc. is consistent with how I tried to build intuition for IVs before  depicting an instrumental variable as being like a 'filter' that picks up only variation in the treatment (CAMP) unrelated to an omitted variable (INDEX) or selection bias.

"A very non-technical way to think about this is that we are taking Z and going through CAMP to get to Y, and bringing with us only those aspects of CAMP that are unrelated to INDEX.  Z is like a filter that picks up only the variation in CAMP (what we may refer to as ‘quasi-experimental variation) that we are interested in and filters out the noise from INDEX.  Z is technically related to Y only through CAMP."

Z →CAMP→Y   

 (you can read the full post for more context)

See also:

Below are some more examples of discussions and descriptions of instrumental variables that have been the most beneficial to my understanding:

Kennedy: “The general idea behind this estimation procedure is that it takes the variation in the explanatory variable that matches up with variation in the instrument (and so is uncorrelated with the error), and uses only this variation to compute the slope estimate. This in effect circumvents the correlation between the error and the troublesome variable, and so avoids the asymptotic bias”

Mastering Metrics:“The instrumental variables (IV) method harnesses partial or incomplete random assignment, whether naturally occurring or generated by researchers….."

“The IV method uses these three assumptions to characterize a chain reaction leading from the instrument to student achievement. The first link in this causal chain-the first stage-connects randomly assigned offers with KIPP attendance, while the second link-the one we’re after-connects KIPP attendance with achievement.”

Dr. Andrew Gelman with comments from Hal Varian: How to think about instrumental variables when you get confused

“Suppose z is your instrument, T is your treatment, and y is your outcome. So the causal model is z -> T -> y……. when I get stuck, I find it extremely helpful to go back and see what I've learned from separately thinking about the correlation of z with T, and the correlation of z with y. Since that's ultimately what instrumental variables analysis is doing.”

"You have to assume that the only way that z affects Y is through the treatment, T. So the IV model is
T = az + e
y = bT + d

It follows that
E(y|z) = b E(T|z) + E(d|z)
Now if we
1) assume E(d|z) = 0
2) verify that E(T|z) != 0
we can solve for b by division. Of course, assumption 1 is untestable.
An extreme case is a purely randomized experiment, where e=0 and z is a coin flip."

A Guide to Econometrics. Peter Kennedy.
Mastering 'Metrics. Joshua Angrist and Jörn-Steffen Pischke 

'Big' Data vs. 'Clean' Data

I've previously written about the importance of data cleaning, and recently I was reading a post- Data Science Can Transform Agriculture, If We Get It Right on FarmLink's blog and I was impressed by the following:

"We believe it's a transformational time for this industry – call it Ag 3.0 – when the combination of human know-how and insight, coupled with robust data science and analytics will change the productivity, profitability and sustainability of agriculture."

This reminds me, as I have discussed before in relation to big data in agriculture, of Economist Tyler Cohen's comments on an EconTalk podcast, " the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation."

 But in relation to data cleaning, I thought this was really impressive:

"...we disqualified more than two-thirds of the data collected during our first year. Now, we inspect each combine before use following a 50 point check list to identify any problem that could affect accuracy of collection, have developed a world class Quality Assurance process to test the data, and created IP addressable access to our combines to be able to identify and compensate for operator error. As a result, last year over 95% of collected data met our standard for being actionable. Admittedly, our first year data was “big.” But we chose to view it as largely worthless to our customers, just as much of the data being collected through farmer exchanges, open APIs, or memory sticks for example will be. It simply lacks the rigor to justify use in such important undertakings."

It takes patience and discipline to sometimes to make the necessary sacrifices and put the necessary resources into data quality, and it looks like this company gets it. Data cleaning isn't just academic. It's serious. Maybe it's time to replace #BigData with #CleanData.
Big Data
Data Cleaning
Got Data? Probably not like your econometrics textbook!
Big Ag Meets Big Data (Part 1 & Part 2)
Big Data- Causality and Local Expertise are Key in Agronomic Applications
Big Ag and Big Data-Marc Bellemare
Big Data, IoT, Ag Finance, and Causal Inference
In God we trust, all others show me your code.
Data Science, 10% inspiration, 90% perspiration

Wednesday, October 7, 2015

Metrics Monday with Marc Bellemare

 I have been following Marc Bellemare for a while now on twitter (@mfbellemare) and really became interested in his blog because of the proliferate amount of very good posts related to applied econometrics. He also writes about a number of interesting topics related to applied economics and in areas related to his own research. In the last few weeks (months?), he as been running a series of posts titled 'Metrics Monday where he addresses lots of issues related to applied econometrics that aren't always addressed in typical theory based courses. I think every advanced undergraduate, graduate student, or any 'metrics or analytics practitioner should read all of his econometrics related posts. Below are some links to selected posts. I'll probably add to this list as he posts more, and as I discover older related posts I have not yet read.

Some of my favorite 'Metrics Monday posts: 

Friends *do* let friends do IV

When is heteroskedasticity (not) a problem

Hypothesis Testing in Theory and Practice

Data Cleaning


Rookie mistakes in empirical analysis

What to do with missing data

Other Applied Econometrics Posts by Marc Bellemare:

Love it or logic, Or: people really care about binary dependent variables

A rant on estimation with binary dependent variables

In defense of the cookbook approach to econometrics

Econometrics teaching needs an overhaul

Do Both

Wednesday, September 30, 2015

Big Data, IoT, Ag Finance, and Causal Inference

Over at my applied economics blog, I recently discussed an article from AgWeb; How the feds interest rate decision affects farmers. This actually got me questioning some of the ramifications of leveraging data analysis in the context of ag lending (from both a farmer and lender perspective), which ultimately lead to me thinking about some interesting questions that would be exciting to investigate:
  1.  Is there a causal relationship between producers that leverage IoT and Big Data analytics applications and farm output/performance/productivity
  2. How do we quantify the outcome-is it some measure of efficiency or some financial ratio?
  3. If we find improvements in this measure-is it simply a matter of selection? Are great producers likely to be productive anyway, with or without the technology?
  4. Among the best producers, is there still a marginal impact (i.e. treatment effect) for those that adopt a technology/analytics based strategy?
  5. Can we segment producers based on the kinds of data collected by IoT devices on equipment, aps, financial records, GPS etc.?  (maybe this is not that much different than the TrueHarvest benchmarking done at FarmLink) and are there differentials in outcomes, farming practices, product use patterns etc. by segment
See also:
Big Ag Meets Big Data (Part 1 & Part 2)
Big Data- Causality and Local Expertise are Key in Agronomic Applications
Big Ag and Big Data-Marc Bellemare
Other Big Data and Agricultural related Application Posts at EconometricSense
Causal Inference and Experimental Design Roundup

Friday, September 25, 2015

Propensity Score Matching Meets Difference-in-Differences

I recently have stumbled across a number of studies incorporating both difference-in-differences  (DD) and propensity score methods.  As discussed before, DD is a special case of fixed effects panel methods.  

In the World Bank's publication "Impact Evaluation in Practice" they give a nice summary of the power of DD in identification of causal effects:

"...we can conclude that many unobserved characteristics of individuals are also more or less constant over time. Consider, for example, a person's intelligence or such personality traits as motivation, optimism, self-discipline, or family health history...Interestingly, we are canceling out(or controling for) not only the effect of observed time invariant characteristics but also the effect of unobserved time invariant characteristics such as those mentioned above"

So with DD we can actually control for unobserved characteristics that we may not have data on or maybe couldn't measure appropriately or even quantify! That's powerful. In this framework we are controlling for unobservable characteristics that may be contributing to selection bias, we are achieving identification of treatment effects in a selection on unobservables context.

On the other hand, with propensity score matching, we are appealing to the conditional independence assumption, the idea that matched comparisons imply balance on observed covariates, which ‘recreates’ a situation similar to a randomized experiment  where all subjects are essentially the same except for the treatment(Thoemmes and Kim, 2011). Propensity score matching can identify treatment effects in a selection on observables context. 

But, what if we combine both approaches. The Impact Evaluation book has a section on mixed methods that gives a really good treatment of the power of using both PSM and DD:

"Matched difference-in-differences is one example of combining methods. As discussed previusly, simple propensity score matching cannot account for unobserved characteristics that might explain why a group chooses to enroll in a program and that might also affect outcomes. By contrast, matching combined with difference-in-differences at least takes care of any unobserved characteristics that are constant across time between the two groups"

Below are several papers that utilize the combination of DD and PSM:

Does Matching Overcome Lalonde’s Critique of Nonexperimental Estimators? Jeffrey Smith and Petra Todd. University of Maryland. 2003

Do Agricultural Land Preservation Programs Reduce Farmland Loss? Evidence from a Propensity Score Matching Estimator
Xiangping Liu and Lori Lynch January 2010

Measuring the Impact of Meat Packing and Processing Facilities in the Nonmetropolitan Midwest: A Difference- in-Differences Approach
Georgeanne M Artz, Peter Orazem, Daniel Otto
November 2005 Working Paper # 03003
Iowa State University

How Effective is Health Coaching in Reducing Health Services Expenditures?
Yvonne Jonk, PhD,* Karen Lawson, MD,w Heidi O’Connor, MS,z Kirsten S. Riise, PhD,y David Eisenberg, MD,8z Bryan Dowd, PhD,z and Mary J. Kreitzer, PhD, RN, FAANw
Medical Care �� Volume 53, Number 2, February 2015


Impact Evaluation in Practice
Paul J. Gertler Sebastian Martinez, Patrick Premand,
Laura B. Rawlings and Christel M. J. Vermeersch
Default Book Series.December 2010

Friday, September 11, 2015

Mastering Metrics....and the Grain Markets

I recently just finished two great books, Mastering 'Metrics, and Mastering the Grain Markets.

Mastering the Grain Markets

While I have a background in agricultural and applied economics, my interest was always related to the public choice and the environmental implications of biotechnology, as well as econometrics (hence this blog). So, I didn't really have much formal background related to commodity markets, other than a little exposure to options through a couple of finance classes.  I have certainly read some really good extension publications related to futures, options, and hedging but Mastering the Grain markets by Elaine Kub really brings these issues to life. She brought me back to my crop scouting days in her many discussions of corn production and the agronomics of our major commodities. She also tackles some major issues and controversies associated with modern agriculture, everything from speculation, to biotech to sustainability issues, gluten fad diets and more. Prepare for a trip from gate to plate in this book that teaches like a textbook but reads like a novel!

Even if you think all you are interested in are the specifics around how futures and options work, you'll end up being convinced that the holistic approach is essential. To borrow one quote:

"..any participation in the grain markets is a form of participation in agriculture, and it should be regarded as one piece of a beautiful, challenging, miraculous whole."
A couple areas that struck me as particularly interesting were her discussions of counterparty risk and over the counter contracts. I'll probably have a separate post on this blog or my ag econ blog regarding counterparty risk.

So, why share a review about a grain markets book on an applied econometrics blog? Well, all the discussion about OTCs and risk management rekindled my interest in copulas, which I have blogged about before, and also made me a little more curious about index based crop insurance. Risk modeling in commodities go hand in hand with econometrics. Oh, and she even hits on precision agriculture and alludes to big data in agriculture:

"At the end of the growing season, he has every data point he could possibly need (seed population, seed depth, input rates, final yield, soil moisture, etc.) to fine tune his production practices on each GPS mapped square foot of his farm."

Mastering Metrics

Before reading MM, I had previously read Angrist and Pischke's Mostly Harmless Econometrics. It was my first rigorous introduction to the potential outcomes framework and causal inference. It took me a while to work through and I still reference it often. Even though Mastering 'Metrics was supposed to be a 'lite' version or maybe an undergraduate version of MHE, reading in 'reverse' order worked out well. What I really liked was their intro to regression, and the presentation of regression as a matching estimator becomes even more crystal clear to me than it did in MHE. To borrow a quote:

"Specifically, regression estimates are weighted averages of multiple matched comparisons"

I really think a lot of people I encounter have a hard time thinking about that. I also got better insight and clarification on a number of issues related to instrumental variables, regression discontinuity, and difference-in-differences.  Within the IV discussion, I really like the causal-chain of effects presentation and discussion of 'intent to treat', and better understand all the things related to compliers and noncompliers etc. They also really got me up to speed with regard to the differences in parametric vs. non-parametric RD and an important distinction between fuzzy and sharp RD:

"....with fuzzy, applicants who cross a threshold are exposed to a more intense treatment, while with a sharp design, treatment switches cleanly on or off at the cutoff."

Another thing that stood out with me, in their DD chapter they made some clarifications about weighted regression and clustered standard errors that seemed very helpful. Other things in general, I really liked their treatment of the regression anatomy formula and understood it much better in this reading.  Their basic review and treatment of inference, standard errors, and t-statistics is really great and a good way to segway an undergraduate student from an introductory statistics class into the more advanced topics they present later in the text. I could also see certain graduate programs, even outside of economics making use of this text.

Both Mastering the Grain Markets and Mastering Metrics end with a final chapter tying everything together.

I highly recommend both books.

More thoughts....

So above I mentioned that risk modeling and econometrics go hand in hand, but have been thinking, were any of the techniques covered in MM useful for work related to the commodities markets? In terms of informing marketing and risk management strategies, I'm not sure. Maybe some readers have some idea. But, in terms of policy analysis as it relates to commodity markets, perhaps. There are some that advocate that we should restrict speculation in commodity markets. Scott Irwin looked at the impact of index funds on commodity markets, using granger causality (although granger causality was not discussed in MHE or MM).  Other work has relied on panel methods. A quick google search reveals some related work using instrumental variables discussed in MM. For now I'll just say to be continued.....


Irwin, S. H. and D. R. Sanders (2010), “The Impact of
Index and Swap Funds on Commodity Futures Markets:
Preliminary Results”, OECD Food, Agriculture and
Fisheries Working Papers, No. 27, OECD Publishing.
doi: 10.1787/5kmd40wl1t5f-en