Measuring the speed of sound

You probably noticed that the data points for different distances are not spread out the same way. At some distances the points are closer together and at some distances they are more scattered around. This can have a couple of reasons. Depending on the position of the second microphone, the sound reflections in my little cramped room will have a different influence on the measurement, making it more or less robust. Also my clapping will probably not be exactly constant and my current form has influenced the quality of the measurements at the different distances.

So far, we treat every point the same way. When we fit the linear function to the data, all points are as important as the others. But if we look at the data, we see that some points are obviously not as reliable as others. The spread of the points at 40 cm is much higher than at 50 cm, why should we trust those points equally? Well, we don’t have to.

All we have to do is take all the data points of one distance and average them into one data point with an error that corresponds to the spread of the data points. Then we can tell the fit algorithm to pay less attention to the points with large errors than to those with small errors.

This is the point where the “serious business” starts. We actually want to work with the data we collected instead of just plotting it and while you could principally manipulate the data with any programming language and then plot it with gnuplot, I prefer to keep everything together. (Also matplotlib plots just look nicer than gnuplot plots.)

My data manipulation and plotting tool set of choice: python with matplotlib, numpy and scipy.

They are all free and available for all the major platforms.

This is not the place for an in depth tutorial of python or the libraries, but I’ll quickly go over the main points of my program (If you want a more in depth description of the capabilities of the libraries, you should google them and take a look at their documentations):

#!/usr/bin/python

from matplotlib import pyplot as plt
from numpy import loadtxt, mean, std, array, linspace, sqrt
from scipy.optimize import curve_fit

# Load the raw data
data = loadtxt('sound.txt')

# Combine multiple data points at same distance into one with error
points = {}
for P in data:
    if P[0] not in points:
        points[P[0]] = []
    points[P[0]] += [P[1]]

x = array(points.keys())
t = array(list(mean(list(points[xx])) for xx in x))
e = array(list(std(points[xx])/sqrt(len(points[xx])) for xx in x))

# Fit a linear function to the data
def linfunc(x, m, n):
    return m*x + n
#def errfunc(p, x, y, e):
#    return (y - linfunc(p, x)) / e
p0 = [0.0, 0.0]
p1, pcov = curve_fit(linfunc, x, t, p0, e)
print("m = {0} +- {1}".format(p1[0], sqrt(pcov[0,0]))) # The diagonals of the covariant matrix are an
print("n = {0} +- {1}".format(p1[1], sqrt(pcov[1,1]))) # estimator for the variance of the fit parameters

# Make a nice plot
plt.figure(figsize=(8,4.8), dpi=100)
plt.subplot(111)
plt.xlabel(r"distance $x$ / [cm]")
plt.ylabel(r"time difference $\Delta t$ / [$\mu$s]")
plt.errorbar(x, t, e, fmt="r.", label="Data")
xx = linspace(5, 65, 2)
plt.plot(xx, linfunc(xx, *p1), 'g-', label=r"$\Delta t = m x + n$ with $m = {0:1.1f}\ \mu$s/cm, $n = {1:1.1f}\ \mu$s".format(*p1))
plt.legend(loc="upper left")
plt.xlim(0,70)
plt.savefig("sound_pylab.png")
plt.show()

After the mandatory shebang (working under Linux here) and the inclusion of all the necessary libraries, the first interesting thing happens at line 8: loadtxt is a very convenient function from numpy that lets you load a file of data in “standard format”, i.e. a couple of rows of numbers with white spaces as separators, in one line. The return value is a 2-dimensional array which we then further dissect in lines 11 to 15. There we put all the points of the same distance together into a dictionary, so we can conveniently access them later.

In lines 17 to 19 we create the new (averaged) points as numpy arrays: Line 17 is straight forward; we take all the different (unique) distances and turn them into an array of, well, the distances. Line 18 is a bit more convoluted, as we have to convert between lists and arrays a few times to make use of pythons handy “list( f(A) for A in B)” syntax. All we do there is create an array of the averages of the points at the given distances. These are our y-coordinates. Line 19 is equally convoluted and even contains a bit of statistics. Instead of an array of the average of the points we create an array of the standard deviations of the points. This will be the standard error of our averaged points.

The numpy function std calculates the standard deviation of a given set of numbers. That standard deviation is a measure for how spread out the numbers in that set are. This is almost the number that we want, but just almost. We’re not interested in the spread of the numbers in the set, but the error of the average we calculate from those numbers! This is important.

If you have a given distribution of random numbers, then the standard deviation as calculated by numpy does not depend on the number of points in that set, as it is a property of the distribution. It describes the expected error (i.e. the deviation from the actual mean of the distribution), if you randomly choose one of the numbers in the set.

The standard error of the mean of the set on the other hand does depend on the number of points in the set. If you have a larger set, then the average of that set will be (in general) closer to the actual mean of the distribution than if you have only a small set of numbers. (More numbers means that the statistical errors are more likely to cancel each other.) So this number describes the expected error (i.e. the deviation from the actual mean of the distribution), if you average over all the numbers in the set.

And to calculate the standard error of the mean of the set, you have to take the standard deviation of the set itself and divide it by the square root of the number of points in the set. So that’s exactly what we do in line 19. Again, at this point you will have to trust me and wait for a later article, or read a book on basic statistics. Or the Wikipedia article.

In lines 21 to 29 we fit a linear function to the (now averaged) data, taking their errors into account. This means that points with large errors (i.e. low confidence in their accuracy) will have less influence on the result than points with small errors (i.e. high confidence in their accuracy). This is done by the method of weighted least squares (google it).

The advantage of scipy’s curvefit is that you can fit any function you can program in python to the data at hand, not only arithmetic functions, though we don’t really need that functionality here.

The lines 31 to 42 are just there to produce a pretty plot that really shows off our science skills:

pylab fit

This looks really good. And it yields the following results:

\begin{array}{rl} m & = (30.3 \pm 1.3) \ \mu \mathrm{s/cm} \\ n & = (-71.4 \pm 42.6) \ \mu \mathrm{s} \\ c_{S} & = (331 \pm 14) \ \mathrm{m/s} \ (4.3 \%) \end{array}

We almost doubled our accuracy there! But we are still not done.

About these ads
5 comments
  1. Brett_cgb said:

    Interesting analysis, but I’ve got a question. (I’m using MS Excel.)

    Q) My initial data plot and linear trendline exactly match yours. My trendline equation provides m=28.984 and n=-78.300 (same as you within the precision you provided). It doesn’t provide the +/-error for m or n. How was the +/- error calculated?

  2. Brett_cgb said:

    Ummm… It seems I’ve opened a can of worms, based on what I’ve been able to find.
    The answer to that question appears appropriate for a weeks worth of university statistics classes.

    • Ast said:

      Yeah… I don’t know where you can get this number in Excel, but gnuplot just gives us those numbers together with the fit parameters.

      Behind the curtain the the calculation is indeed a bit complicated and involves the (estimated) Jacobian which then gets turned into a variance-covariance-matrix of the parameters. Then you use the diagonal of that matrix as estimates for the uncertainty of the parameters.

      If Excel does not offer a way to get these numbers in an easy way, I suggest you switch to another program for your data analysis. ;)

      And another tip: If you want to know how this stuff works, you should take a look into the documentations of programs which have realised this functionality, e.g. gnuplot, scipy or any other open source scientific library. Usually there is at least a bit of theoretical background explained.

  3. Brett_cgb said:

    > you should take a look into the documentations of programs which have realized
    > this functionality, e.g. gnuplot, scipy or ….

    That’s what inspired the “can of worms” comment…. B)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: