**More on Fitting and Statistics: **

Quality Factor, Chi Squared, Estimating Parameters such as the Slope

A **statistic** is
any function of your data. For example,
suppose you measure the temperature outside ten times. You decide that you want
to report what the temperature was at that moment. You could just use your first measurement, T_{1}
or perhaps you decide to use the last measurement T_{10}. You have many options:

_{}

Each one of the above choices is a function of the data and
therefore represents a statistic that one could use to estimate the **true value [1]**
of the temperature. Hopefully, most
physics students would have some basic intuition and their common sense might
suggest that the average value should be a good choice and that it should be a better
choice than most of the other suggestions above. For a moment ask yourself if you actually
performed the ten measurements and wanted to report the correct temperature
what value you would report. Also ask if
there are any arguments that you could use to justify your choice.

Statistics is a field which addresses the question of which
potential functions of a data are the best for estimating quantities of
interest. The above problem was
straightforward in that one directly measures the quantity of interest. However,
often the measurement is more complex.
For example, a student might measure the position and the time of a
moving object but be interested in the object’s velocity. If the object is moving at a constant
velocity one could plot the data and draw a straight line that passes as close
to the data as possible and then use the slope of this line to estimate the
velocity. It is not obvious that this
approach is related to the method described in the temperature example. Is this slope in any way a function of the
position-time data recorded ? Is the slope a statistic as defined above
that represents an estimator for velocity?
The answer is yes. **Curve fitting,
although complex, is just a method for finding an estimator and in principle
that estimator will always be some function of the data used in the fit.**

Before pursing the relationship between parameters extracted using curve fitting and explicit functions of the data such as an average, let us consider how one judges a statistic. Let us return to the temperature measurement and attempt to evaluate the possible choices. First we introduce some qualities that one might require a good statistic or estimator should have:

consistent |
Tends to be close to the true value |

efficient |
Rating as to how close the statistic is on average to the
true value |

unbiased |
The statistic is just as likely to be above as it is below
the true value |

Here we take the reasonable approach that there is a **TRUE **value and that one can imagine
many experiments done in the same manner.
By examining the methods used in an experiment one can build a theory
that predicts the likelihood or probability of getting an experimental
result. Let us imagine a simpler
problem. How likely is it to role a die
and record five sixes in a row? Knowing
that a die has an equal likelihood of landing on any of its six faces we can
calculate this probability. As a matter
of fact we can calculate the probability for any combination and therefore
predict how many times five sixes will appear if we role five die ten thousand
times. We don’t need to perform the
experiment to predict this result. We
also know that if we did perform the experiment the results will vary because a
die produces random results. In the same manner we can evaluate results from an
experiment and evaluate whether an average value or an individual measurement
has superior qualities as an estimator. We can evaluate the properties of each
statistic chosen to estimate the temperature.

The definitions given in the table are paraphrased
definitions for quantities that are evaluated in the field of statistics to
judge the choices used to estimate quantities of interest. So if you want to
provide your best guess for the temperature you need to evaluate your choices
as to consistency, efficiency, bias and other characteristics. Informally, you
can probably argue that the average is better than any one measurement of the
temperature. This of course is not based
on a single experiment because it just might be the case that T_{7} is
exactly the true temperature when you performed the experiment. And therefore,
for your measurement, T_{7} would be the best choice (of course you do
not know this). However the recommend statistic is the average which was
evaluated based on the expected outcome of many temperature experiments. Over many trials you will find that the
average temperature is better that T_{7}. It is very unlikely that if
your classmate performs the same experiment as you that their T_{7}
will duplicate your result and be very close to the exact true temperature._{ }This is the sense in which we evaluate
performance of statistics in terms of quality factors such as bias. One needs to imagine many experiments of the
same type performed and characterize each potential statistics by seeing how it
performs.

Hopefully, you are convinced that once can **formally introduce qualities that can be
used to evaluate the choice of a statistic and that a good experimenter will choose
estimators that have optimal properties**.
In P140L and P150L the student is usually given the method to use to
estimate values. We typically use
averages and fitting. This discussion only
serves to provide some background and a bit more insight into the problems of
estimation and statistics so that the student has the sense that methods
employed are justified by more rigorous mathematics.

As stated most students do not need to be convinced that using an average value is a reasonable choice for an estimator. Let us return to the question of fitting data as a method for extracting values or estimators.

Given a set of data a student can draw a line through the points. To accomplish this formally we define chi squared.

_{}

You take each data point _{} and evaluate your
fitting function at the corresponding point_{}. If the data and fit
agreed perfectly then _{}. If the data has
uncertainty then there will not be perfect agreement for every data point but some
data points may be very close in value to the fit. Dividing these differences
by the uncertainties allows you to put more weight on data with smaller
uncertainty.

Curve fitting è minimize _{}

Find those parameters that make _{} as small as
possible. Actually if the fit function
is

_{}

Then the process of finding the smallest value can be carried out mathematically. For the case where all of the data have the same uncertainty,

_{}

So one can simply plug their data into the above formula to
find the value for the slope that minimizes _{}.

If the fitting function is not linear it is usually not possible to solve the equations for an exact formula but the parameters can be determined using a computer process.

Further analysis of the curve fitting method has shown that
the resulting values for the parameter that minimize _{} are good, well behaved
estimators.

Before concluding it is useful to look more carefully at the
definition of _{}. One can argue that
on average a data point will typically be about _{} away from the
fit. For the student this should seem
plausible. Therefore _{}. This is called the
normalized value of chi squared and we expect it to be close to 1. If this value is too small then the fit is
too good and you have overestimated your errors. If it is too large then there is a
problem. If the best curve that you can
find misses the data by a large amount then something is wrong. It might be that the computer failed to find
the correct set of parameters. The minimization algorithm is reporting a bad
result. For complicated functions a
false minimum is sometimes returned. It
might be that your choice of fitting function is incorrect. You can’t in general put a straight line
through data that follow an exponential curve and have all of the data close to
the line.

**Summary**

- Estimators
are statistics which are some function of the data.
- Methods
exist to evaluate the quality of these estimators.
- Fitting
data to a curve is one way to find estimators for quantities of interest
and although not obvious these estimators are statistics and have been
shown to be very good estimators.
_{}can be used to judge the quality of the fit.

[1] In most
cases the experimenter doesn’t know the true value. It is understood that at the moment you make
your measurement and at the location where you make your measurement there is
indeed an exact value for the temperature.
The **TRUE VALUE **is therefore
based on the premise that in principle one could carry out a measurement to an
accuracy that would reveal a value with an uncertainty approaching zero.