# A statistical analysis of a malfunctioning hot water heater

Our shower experiment is straight forward: Every time someone takes a shower there is a certain probability that the water will turn cold. It either will turn cold or not, there are no other possibilities. We take a certain number of showers and count how often the water turned cold. The probability of getting a certain number of cold showers follows the binomial distribution, which is defined by the numbers of showers we take N and the probability for a cold shower in a single try p.

We do know the number of showers we took, but how do we get the single cold shower probability p? When we first looked at the data, we simply divided the number of bad showers, by the total number of showers. And this is actually the way to go here. Without noticing it, we applied a very common method of fitting the parameters of a theory to a given set of data: the maximum-likelihood estimation.

This works as follows: Using your theory, you calculate the probability for every data point you have. Then you multiply all those probabilities, which gives you the so called Likelihood L of that outcome. Finally you tweak the parameters of your theory as to maximize that likelihood. Or to be short: You search those parameters that are most likely to produce the result that you have measured.

In our case this is especially easy since we have only one data point (we treat open and closed window measurements separately) and just one parameter. So all we have to do, is to find the single bad shower probability p that yields the largest probability of our result. Mathematically speaking, we have to maximise this function:

Everyone who remembers their high school math classes should be able to do this, so I will not go into details about that. Let us just plot the likelihood L vs. p to have a visual “proof” that we really chose the most likely parameters:

# encoding=utf-8 from __future__ import division import matplotlib.pyplot as plt from scipy.stats import binom from scipy.interpolate import interp1d from numpy import linspace, vstack, reshape # the data Ncb = 7 Ncg = 16 Nob = 0 Nog = 21 # some variables for comfort Nc = Ncg + Ncb No = Nog + Nob Ng = Ncg + Nog Nb = Ncb + Nob N = Nc + No # initially "guessed" probabilities po = Nob / No pc = Ncb / Nc # Likelihood plot p_range = linspace(0.001, 0.599, 100) Lo = [] Lc = [] for p in p_range: Lo.append(binom(No, p).pmf(Nob)) Lc.append(binom(Nc, p).pmf(Ncb)) fig, ax = plt.subplots(1,1) ax.grid() ax.set_xlabel('probability $p$') ax.set_ylabel('likelihood $L$') ax.set_title('Likelihood plot') ax.plot(p_range, Lc, '-', label='closed window: $N=%d$, $N_b=%d$'%(Nc,Ncb)) ax.plot([pc, pc], [0, 0.5], 'k-', lw=2) ax.plot(p_range, Lo, '-', label='open window: $N=%d$, $N_b=%d$'%(No,Nob)) ax.plot([po, po], [0, 0.5], 'k-', lw=2) ax.legend(loc='best') fig.savefig('Plots/Likelihood.png')

Here I used the Scipy function binom to calculate the probability mass functions of the binomial distributions. As you can see, our chosen values — marked here with a vertical black line — reside at the maximum of the plotted Likelihoods.

Now we know the most likely probability, but we can be pretty sure that it’s not THE probability. Case in point: if I take another shower the numbers will change. So the best we can do is to find a range of values of which we are more or less sure that the actual probability lies in it. That kind of range is called a confidence interval.