Saturday, January 24, 2015

The Internet of Things, Big Data, and John Deere

What is the internet of things? It's essentially the proliferation of smart products with connectivity and data sharing capabilities that are changing the way we interact with them and the rest of the world. We've all experienced IOT via smartphones, but the next generation of smart products will include our cars, homes, and appliances. A recent Harvard a Business Review article discusses what this means for the future of industry and the economy: 

How Smart, Connected products are Transforming the Competition by Porter and Heppelmann (the same porter of Porter's 5 competitive forces) : 

"IT is becoming an integral part of the product itself. Embedded sensors, processors, software, and connectivity in products (in effect, computers are being put inside products), coupled with a product cloud in which product data is stored and analyzed and some applications are run, are driving dramatic improvements in product functionality and performance....As products continue to communicate and collaborate in networks, which are expanding both in number and diversity, many companies will have to reexamine their core mission and value proposition"

In the article, they view how IOT is reshaping the competitive framework through the lense of Porter's five competitive forces. 

HBR continually uses John Deere as a case study example of a firm that is leading the industry leveraging big data and analytics as a successful model for an IOT strategy, particularly the integration of IOT applications through connected tractors and implements. So, the competitive space expands from a singular focus on a line of equipment, to optimization of performance and interoperability within a connected system of systems: 

"The function of one product is optimized with other related products...The manufacturer can now offer a package of connected equipment and related services that optimize overall results. Thus in the farm example, the industry expands from tractor manufacturing to farm equipment optimization."

TA 'system of systems' approach means not only smart connected tractors and implements, but layers of connectivity and data related to weather, crop prices, and agronomics. That changes value proposition not just for the farm implement business, but also the seed, input, sales, and crop consulting as well. When you think about the IOT, suddenly Monsanto's purchase of Climate Corporation make sense. Suddenly the competitive landscape has changed in agriculture. Both Monsanto and John Deere are offering advanced analytic and agronomics consulting services based on the big data generated from the IOT. John Deere has found value creation in these expanded economies of scope given the date generated from their equipment. Seed companies see added value in optimizing and developing customized genetics as farmers begin to farm (collectin data) literally inch by inch (vs field by field). Both new alliances and rivalrys are being drawn between what was once rather distinct lines of business. 

And, as the HBR article notes, there is a lot of opportunity in the new era of Big Data and the IOT.  The convergence of biotechnology, genomics, and big data has major implications for economic development as well as environmental sustainability. 



Thursday, January 22, 2015

Fat Tails, Kurtosis, and Risk

When I think of 'fat tails' or 'heavy tails' I typically think of situations that can be described by probability distributions with heavy mass in the tails. This might imply that the tails are 'fat' or 'thicker' than other situations with less mass in the tails. (for instance, a normal distribution might be said to have thin tails while a distribution with more mass in the tails than the normal might be considered a 'fat tailed' distribution.)

Basically when I think of tails I think of 'extreme events', some occurrence with an excessive departure from what's expected on average. So, if there is more mass in a the tail of a probability distribution (it is a fat or thick tailed distribution) that implies that extreme events will occur with greater probability.  So in application, if I am trying to model or assess the probability of some extreme event (like a huge loss on an investment) then I better get the distribution correct in terms of tail thickness.

See here for a very  nice explanation of fat tailed events from the Models and Agents blog: http://modelsagents.blogspot.com/2008/02/hit-by-fat-tail.html

Also here is a great podcast from the CME group related to tail hedging strategies: (May 29 2014) http://www.cmegroup.com/podcasts/#investorsAndConsultants

And here's a twist, what if I am trying to model the occurrence of multiple events simultaneously? (a huge loss in commodities and equities simultaneously, or losses on real estate in Colorado and Florida simultaneously). I would want to model a multivariate process that captures the correlation or dependence between multiple extreme events (in other words between the tails of their distributions). Copulas offer an approach to modeling tail dependence, and again, getting the distributions correct matters.

When I think of how do we assess or measure tail thickness in a data set, I think of kurtosis. Rick Wicklin recently had a nice piece discussing the interpretation of kurtosis and relating kurtosis to tail thickness.

"A data distribution with negative kurtosis is often broader, flatter, and has thinner tails than the normal distribution."

" A data distribution with positive kurtosis is often narrower at its peak and has fatter tails than the normal distribution."

The connection between kurtosis can be tricky and kurtosis  cannot be interpreted this way universally in all situations. Rick gives some good examples if you want to read more.  But this definition of kurtosis from his article seems keep us honest:

"kurtosis can be defined as "the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. " with the caveat - "the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features."

But Rick also had an earlier post on fat and long tailed distributions where he puts all of this into perspective in terms of the connection to modeling extreme events as well as a more rigorous discussion and definition of tails and what 'fat' or 'heavy' tailed means:

"Probability distribution functions that decay faster than an exponential are called thin-tailed distributions. The canonical example of a thin-tailed distribution is the normal distribution, whose PDF decreases like exp(-x2/2) for large values of |x|. A thin-tailed distribution does not have much mass in the tail, so it serves as a model for situations in which extreme events are unlikely to occur.

Probability distribution functions that decay slower than an exponential are called heavy-tailed distributions. The canonical example of a heavy-tailed distribution is the t distribution. The tails of many heavy-tailed distributions follow a power law (like |x|–α) for large values of |x|. A heavy-tailed distribution has substantial mass in the tail, so it serves as a model for situations in which extreme events occur somewhat frequently."

Also, somewhat related, Nassim Taleb, in his paper on the precautionary principle and GMOs (genetically modified organisms) discusses such concepts as ruin, harm,  fat tails and fragility, tail sensitivity to uncertaintly etc.  He uses very rigorous definitions of these terms and determines that there are certain things like GMOs that would require a non-naive application of the precautionary principle while other things like nuclear energy would not. (also tune into his discussion of this with Russ Roberts on Econtalk- more discussion on this actual application at my applied economics blog Economic Sense).

Wednesday, January 14, 2015

Are Quasi-Experimental Methods Off the Table for Wellness Program Analysis?

Last november there was a post on the Health Affairs blog related to the evaluation of wellness programs. Here are some tidbits:

"This blog post will consider the results of two compelling study designs — population-based wellness-sensitive medical event analysis, and randomized controlled trials (RCTs). Then it will look at the popular, although weaker, participant vs. non-participant study design."

"More often than not wellness studies simply compare participants to “matched” non-participants or compare a subset of participants (typically high-risk individuals) to themselves over time."

[difference in difference/panel/matched pairs & propensity score methods?]

“Looking at how participants improve versus non-participants…ignores self-selection bias. Self-improvers are likely to be drawn to self-improvement programs, and self-improvers are more likely to improve.” Further, passive non-participants can be tracked all the way through the study since they cannot “drop out” from not participating, but dropouts from the participant group—whose results would presumably be unfavorable—are not counted and are considered lost to follow-up. So the study design is undermined by two major limitations, both of which would tend to overstate savings."

I understand the issue of survivorship bias, and will admit I don't know of an approach off hand to deal with this, but my thoughts are that panel methodsdifference-in-difference and propensity score matching fit firmly in the Rubin Causal Model or potential outcomes framework for addressing issues related to selection bias.

As I read this, it reads like to me that the authors certainly are not fond of some of the most popular quasi-experimental approaches vs. the gold standard of a randomized trial. No arguments that a RCT is by far the most reliable way to identify treatment effects, but I know when it comes to applied work, RCT just isn't happening for a lot of obvious reasons.  I am just not familiar with what they describe as wellness-sensitive medical event analysis  or understand how it does a better job addressing the shortfalls of the quasi-experimental approaches they alluded to. (of course it is a blog post and I'm sure there is more that could be said given space). I also wonder, what about examples of more robust quasi experimental approaches (like instrumental variables)?

Tuesday, January 13, 2015

Overconfident Confidence Intervals

In an interesting post, "Playing Dumb on Statistical Significance" there is a discussion relating to Naomi Oreskes January 4 NYT piece Playing Dumb on Climate Change. The dialogue centers around her reference to confidence intervals and possibly an overbearing burden of proof that researchers apply related to statistical significance.

From the article:

"Although the confidence interval is related to the pre-specified Type I error rate, alpha, and so a conventional alpha of 5% does lead to a coefficient of confidence of 95%, Oreskes has misstated the confidence interval to be a burden of proof consisting of a 95% posterior probability. The “relationship” is either true or not; the p-value or confidence interval provides a probability for the sample statistic, or one more extreme, on the assumption that the null hypothesis is correct. The 95% probability of confidence intervals derives from the long-term frequency that 95% of all confidence intervals, based upon samples of the same size, will contain the true parameter of interest."

As mentioned in the piece, Oreskes writing does have a Bayesian ring to it and this whole story and critique makes me think of Kennedy's chapter on "The Bayesian Approach" in his book "A Guide to Econometrics".  I believe that people often interpret frequentist based confidence intervals  from a bayesian perspective. If I understand any of this at all, and I admit my knowledge of Bayesian econometrics is limited, then I think I have been a guilty offender at times as well.  In the chapter it is even stated that Bayesians tout their methods because Bayesian thinking is actually how people really think and that is why they so often misinterpret frequentist confidence intervals.

In Bayesian analysis, a posterior probability distribution is produced (and a posterior probability interval) that 'chops off' 2.5% from each tail leaving an area or probability of 95%. From a Bayesian perspective, it is correct for the researcher to claim or believe that there is a 95% probability that the true value of the parameter they are estimating will fall within the interval. This is how many people interpret confidence intervals, which are quite different from Bayesian posterior probability intervals. An illustration is given from Kennedy:

"How do you think about an unknown parameter? When you are told that the interval between 2.6 and 2.7 is a 95% confidence interval, how do you think about this? Do you think, I am willing to bet $95 to your $5 that the true value of 'beta' lies in this interval [note this sounds a lot like Oreskes as if you read the article above]? Or do you think, if I were to estimate this interval over and over again using data with different error terms, then 95% of the time this interval will cover the true value of 'beta'"?


"Are you a Bayesian or a frequentist?"

Monday, December 22, 2014

Econ(ometrics) Talk

I just recently purchased Angrist and Pischke's "Mastering Metrics" (HT Marc Bellemare). And timely, today's EconTalk podcast featured Josh Angrist:

"Joshua Angrist of the Massachusetts Institute of Technology talks to EconTalk host Russ Roberts about the craft of econometrics--how to use economic thinking and statistical methods to make sense of data and uncover causation. Angrist argues that improvements in research design along with various econometric techniques have improved the credibility of measurement in a complex world. Roberts pushes back and the conversation concludes with a discussion of how to assess the reliability of findings in controversial public policy areas."

I've mentioned their previous book, Mostly Harmless Econometrics on this blog many times before and had some actual interaction with the authors via their related blog where I asked a regression/matching related question. I have described their book as an off-road backwoods survival manual for practitioners.
 
To say the least I am looking forward to the podcast and reading their latest book.

Some related Posts:

Angrist and Pischke on Linear Probabiity Models
Applied Econometrics
Quasi-Experimental Design Roundup
Analytics vs Causal Inference
The Oregon Experiment, Applied Econometrics, and Causal Inference
The Oregon Experiment and Linear Probability Models

Some related EconTalk podcasts that I highly recommend:

Leamer on the State of Econometrics
Manzi on the Oregon Medicaid Study
Manzi on Knowledge, Policy, and Uncontroled


Tuesday, December 2, 2014

A Cookbook Econometrics Analogy

Previously I wrote a post on applied econometrics, which really was not all that original. It was motivated by a previous post made by Marc Bellemare and Dave Giles, and I just added some commentary on my personal experience as well as some quotes from Peter Kennedy's A Guide to Econometrics.

Since then, I've been reading Kennedy's chapter on applied econometrics in greater detail (I have a 6th edition copy) and I found the following interesting analogy. Typically cookbook analogies relate negatively to practitioners mind numbingly running regressions and applying tests etc. without strong appreciation for the underlying theory, but this is of a different flavor and to me gives a good impression of what 'doing econometrics' actually feels like:

From The Valavanis (1959, p.83) in Kennedy 6th Edition:

"Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he used shredded wheat; and he substitutes green garment die for curry, ping-pong balls for turtles eggs, and for Chalifougnac vintage 1883, a can of turpentine."

It really gets uncomfortable when you are presenting at a seminar or conference or other audeince and someone that isn't elbow deep in the data challenges points out that your estimator isn't valid theoretically because you used 'turpentine' when the recipe (or econometric theory) calls for  Chalifougnac vintage 1883 or someone well-versed in theory but unaware of the social norms of applied econometrics tries to make you look incompetent by pointing out this 'mistake.'

Also gives me  That Modeling Feeling.

Friday, November 28, 2014

Applied Econometrics

I really enjoy Marc Bellemare's applied econ posts, but I really enjoy his econometric related posts (for instance a while back he wrote some really nice posts related  linear probability models   here and here).

Recently he wrote a piece entitled "In defense of the cookbook  approach to econometrics." At one point he states:

"The problem is that there is often a wide gap between the theory and practice of econometrics. This is especially so given that the practice of econometrics is often the result of social norms..."

He goes on to make the point that 'cookbook' classes should be viewed as a complement, not a supplement to theoretical econometrics classes.

It is the gap between theory and practice that has given me a ton of grief in the last few years. After spending hours and hours in graduate school working through tons of theorems and proofs to basically restate everything I learned as an undergraduate in a more rigorous tone, I found that when it came to actually doing econometrics I wasn't much the better, or  sometimes wondered if maybe I even regressed. At every corner was a different challenge that seemed at odds with everything I learned in grad school.  Doing econometrics felt like running in water. As Peter Kennedy states in the applied econometrics chapter of his popular A Guide to Econometrcs, "econometrics is much easier without data".

As Angrist and Pischke state: "if applied econometrics were easy theorists would do it."

I was very lucky that my first job out of graduate school was at the same university where I attended as an undergraduate, and I had the benefit of my former professors to show me the ropes, or the 'social norms' as mentioned by Bellemare. The thing is, all along, I just thought that since I was an MS vs PhD graduate, maybe I didn't know these things because I just hadn't had that last theory course, or maybe the 'applied' econometrics course I took was too weak on theory. But as Kennedy points out:

"In virtually every econometric analysis there is a gap, usually a vast gulf, between the problem at hand and the closest scenario to which standard econometric theory is applicable....the issue here is that in their econometric theory courses students are taught standard solutions to standard problems, but in practice there are no standard problems...Applied econometricians are continually faced with awkward compromises..."

The hard part for the recent graduate that has not had a good applied econometrics course is figuring out how to compromise, or which sins are more forgivable or harmless than others.

Another issue with applied vs. theoretical econometrics is software implementation. Most economists I know seem to use STATA, but I have primarily worked in a SAS shop and have taught myself R. But most of my coursework econometrics was done on PAPER with some limited work in SPSS.  GRETL is also popular as a teaching tool. Statistical programming is a whole new world, and propensity score matching in SAS is not straight forward (although here  314-2012 is a really nice paper if you are interested). Speaking of which, if you don't have the luxury of someone showing you the ropes, maybe the best thing you can do is attend some conferences. While not strictly an academic conference, SAS Global Forum has been a great conference with proceedings replete with applied papers with software implementation. R bloggers also offer some good examples of applied work with software implementation.



See also:
 
Culture War: Classical Statistics vs. Machine Learning

Ambitious vs. Ambiguous Modeling

Mostly Harmless Econometrics as an off-road backwoods survival manual for practitioners.