title | author | license | mathjax | tags | summary | layout | src |
---|---|---|---|---|---|---|---|
Sampling Importance Resampling (SIR) and social revolution. |
Jonathan Olmsted |
GPL (>= 2) |
true |
armadillo modeling featured |
We use SIR to characterize the posterior distribution of parameters associated with the probability of social revolution. |
post |
2014-10-23-SIR-and-revolution.Rmd |
The purpose of this gallery post is several fold:
- to demonstrate the use of the new and improved C++-level
implementation of R's
sample()
function (see here) - to demonstrate the Gallery's new support for images in contributed posts
- to demonstrate the usefulness of SIR for updating posterior beliefs given a sample from an arbitrary prior distribution
The application in this post uses an example from Jackman's Bayesian Analysis for the Social Sciences (page 72) which now has a 30-year history in the Political Science (See Jackman for more references). The focus is on the extent to which the probability of revolution varies with facing a foreign threat or not. Facing a foreign threat is measured by "defeated ..." or "not defeated ..." over a span of 20 years. The countries come from in Latin America. During this period of time, there are only three revolutions: Bolivia (1952), Mexico (1910), and Nicaragua (1979).
<style> table {margin-left : auto ; margin-right : auto ;} table, th, td { padding : 5px ; background-color : lightgrey ; border: 1px solid white ;} </style>Revolution | No Revolution | |
---|---|---|
Defeated and invaded or lost territory | 1 | 7 |
Not defeated for 20 years | 2 | 74 |
The goal is to learn about the true, unobservable probabilities of revolution given a recent defeat or the absence of one. That is, we care about
and
And, beyond that, we care about whether
These data are assumed to arise from a Binomial process, where the
likelihood of the probability parameter value,
where
A Bayesian statistician could approach the question a bit more
directly and compute the probability that
Sampling Importance Resampling allows us to sample from the posterior
distribution,
by resampling from a series of draws from the prior,
We begin by drawing many samples from a series of prior
distributions. Although using a prior Beta prior distribution on the
In particular, we will consider our posterior beliefs about the different in probabilities under five different prior distributions.
{% highlight r %} dfPriorInfo <- data.frame(id = 1:5, dist = c("beta", "beta", "gamma", "beta", "beta"), par1 = c(1, 1, 3, 10, .5), par2 = c(1, 5, 20, 10, .5), stringsAsFactors = FALSE) dfPriorInfo {% endhighlight %}
id dist par1 par2 1 1 beta 1.0 1.0 2 2 beta 1.0 5.0 3 3 gamma 3.0 20.0 4 4 beta 10.0 10.0 5 5 beta 0.5 0.5
Using the data frame dfPriorInfo
and the plyr
package, we will
draw a total of 20,000 values from each of the prior
distributions. This can be done in any number of ways and is
completely independent of using Rcpp for the SIR magic.
{% highlight r %} library("plyr") MC1 <- 20000 dfPriors <- ddply(dfPriorInfo, "id", .fun = (function(X) data.frame(draws = (do.call(paste("r", X$dist, sep = ""), list(MC1, X$par1, X$par2)))))) {% endhighlight %}
However, we can confirm that our draws are as we expect and that we have the right number of them (5 * 20k = 100k).
{% highlight r %} head(dfPriors) {% endhighlight %}
id draws 1 1 0.7124225 2 1 0.5910231 3 1 0.0595327 4 1 0.4718945 5 1 0.4485650 6 1 0.0431667
{% highlight r %} dim(dfPriors) {% endhighlight %}
[1] 100000 2
Now, we write a C++ snippet that will create our R-level function to
generate a sample of D
values from the prior draws (prdraws
) given
their likelihood after the data (i.e., number of success -- nsucc
,
number of failures -- nfail
).
The most important feature to mention here is the use of some new and
improved extensions which effectively provide an equivalent,
performant mirror of R's sample()
function at the
C++-level. Note that you need the RcppArmadillo 0.4.500.0 or newer for this
version of sample()
.
The return value of this function is a length D
vector of draws from
the posterior distribution given the draws from the prior distribution
where the likelihood is used as a filtering weight.
{% highlight cpp %}
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp ;
// [[Rcpp::export()]] NumericVector samplePost (const NumericVector prdraws, const int D, const int nsucc, const int nfail) { int N = prdraws.size(); NumericVector wts(N); for (int n = 0 ; n < N ; n++) { wts(n) = pow(prdraws(n), nsucc) * pow(1 - prdraws(n), nfail); } RcppArmadillo::FixProb(wts, N, true);
NumericVector podraws = RcppArmadillo::sample(prdraws, D, true, wts);
return(podraws);
} {% endhighlight %}
To use the samplePost()
function, we create the R representation
of the data as follows.
{% highlight r %} nS <- c(1, 2) # successes nF <- c(7, 74) # failures {% endhighlight %}
As a simple example, consider drawing a posterior sample of size 30
for the "defeated case" from discrete prior distribution with equal
weight on the
{% highlight r %} table(samplePost(c(.125, .127, .8), 30, nS[1], nF[1])) {% endhighlight %}
0.125 0.127 9 21
Again making use of the plyr package, we construct samples of size
20,000 for both dfPost
.
{% highlight r %} MC2 <- 20000 f1 <- function(X) { draws <- X$draws t1 <- samplePost(draws, MC2, nS[1], nF[1]) t2 <- samplePost(draws, MC2, nS[2], nF[2]) return(data.frame(theta1 = t1, theta2 = t2)) }
dfPost <- ddply(dfPriors, "id", f1) {% endhighlight %}
{% highlight r %} head(dfPost) {% endhighlight %}
id theta1 theta2 1 1 0.3067334 0.0130865 2 1 0.1421879 0.0420830 3 1 0.3218130 0.0634511 4 1 0.0739756 0.0363466 5 1 0.1065267 0.0460336 6 1 0.0961749 0.0440790
{% highlight r %} dim(dfPost) {% endhighlight %}
[1] 100000 3
Here, we are visualizing the posterior draws for the quantity of
interest
The full posterior distribution of
Recall that the priors are, themselves, over the individual
revolution probabilities,
At least across these specifications of the prior distributions on