Great, I love the earthquake example because there are plenty of free datasets available I can use to demonstrate the different analysis techniques.

raven15 wrote: ↑Wed Dec 12, 2018 11:54 pm

For earthquakes, and to a much lesser extreme markets, usually nothing happens, and when things do happen, several of them are likely to happen in a row. Suppose you measured the total slip per day at a given point on a fault line every day for 100 years. Then you assigned a standard distribution to the recorded data and noted the mean, median, and 1, 2, and 3-sigma ground motions. You would have a terrible model of earthquake behaviour because on most days nothing happened, on a few days something minor happened, and on rare events or perhaps not at all during your study the ground shifted cataclysmically.

Agreed, this would be a terrible model. The magnitudes of earthquakes definitely do not follow a normal distribution. However,

*suitably computed averages of multiple earthquakes do*.

Let's see some data. I found a USGS dataset of 5.5+ magnitude earthquakes from 1965-2016 here:

https://www.kaggle.com/usgs/earthquake-database. It's a 52-year period including 23,412 recorded earthquakes, more than enough to have fun with. Here's some summary information about the earthquake magnitudes.

Code: Select all

```
count min median max mean sd
23412 5.5 5.7 9.1 5.88 0.42
```

Naturally, just because a mean and standard deviation can be computed for a dataset doesn’t mean anything sensible can be said about the mechanism underlying the data generating process. Say we made the mistake of applying a normal (Gaussian) distribution to the data and computing the implied counts of earthquakes above a certain magnitude. To illustrate, here’s a quick plot of the 23412 earthquake magnitudes with the distribution N(5.88, 0.42

^{2}) superimposed.

Under this bad normal model, how many 7.0+ earthquakes would be expected over the time period? 96.7, much less than the 738 observed in actuality. As expected, a normal distribution is an exceptionally poor approximation to earthquake magnitudes. Besides the truncation issue where no earthquakes below 5.5 were included in the data, there is also an obvious long right tail.

Even though earthquake magnitudes clearly do not follow the Gaussian distribution, suitably normalized averages will eventually do so. Let’s look at the average magnitudes of all 5.5+ earthquakes by month.

Code: Select all

```
months min median max mean sd
624 5.62 5.87 6.31 5.88 0.098
```

This is already closer to a normal distribution. Under this model, how many months would we expect to see an average earthquake magnitude of 6.1+? 8.62, compared to 22 months in actuality. While the model is still not a great fit, we see that simply aggregating the data led to a distribution that is better approximated by the normal.

OK, now let's talk bootstrapping. I resampled the historical dataset once to generate a new average magnitudes by month plot and ended up with the following:

Code: Select all

```
months min median max mean sd
624 5.62 5.87 6.47 5.89 0.12
```

This is actually a little more variable than the historical data, but appears pretty similar. Now, the normal approximation predicts 23.6 months with 6.1+ average magnitude earthquakes, and in actuality 27 months were observed in this resampled dataset. So the bootstrapped estimate already provides better results than a model using the actual historical data!

Is this a feature or coincidence? Let’s create 1,000 bootstrapped datasets to check. Here is a plot of the results.

We’ve finally managed to create a normal distribution out of earthquake data. Note that the observed mean across all simulations, 23.61, falls very close to the true count observed in the historical data set, 22.

How can this be? This last plot is a plot of averages: each observation is a count of months across the entire 52-year period, not a count of individual months anymore. We know that within each 52-year history earthquakes are not independent; however, the unit of analysis is now the set of “alternate 52-year histories” which are indeed independent from one another. With 1,000 of these alternate histories the normal approximation is very good.

After all that, here are my takeaways.

1) Bootstrapping does not require the usual modeling assumptions on a dataset. If all you’re trying to do is estimate a quantity like mean or standard deviation, it doesn’t matter if the data follows a typical distribution, if individual observations are i.i.d., if there is time dependence, etc.

2) Plotting bootstrapped estimates tends to result in a normal distribution even when the underlying phenomenon is not normal, displays time dependence, spatial dependence, etc.

3) The fact that the author's bootstrapped Sharpe ratios follow a normal distribution demonstrates that he executed a bootstrap correctly, not that his underlying model was wrong. We actually can't say much about the underlying model just based on the bootstrapped results.

I'll even go one step further. If Professor Mandelbrot had created a stock market model that accounted for all the behaviors he observed, and I were able to use the model to generate alternate histories for the stock market and computed all the Sharpe ratios via bootstrapping, I'd wager these bootstrapped ratios would still follow a normal distribution!