This year my linear algebra class is using regression to model real world data. The data ranges from climate change to bank interests to chemical reactions.
A standard question is, given the data, how do we choose the model. Since most of the data is in two variables say (x,y). Here is the usual process.
First Plot the data, using scatter plot. This will give you an idea as to “do you expect a linear or a nonlinear relationship between x and y?”
Consider if smoothing the data will help. If your graph looks like a noisy line or a noisy quadratic, rolling average will make it smoother.
Decide on the model. Looking at the scatter plot you should be able to get an idea if the relationship between x and y is linear, quadratic, cubic and so on.
Write out your model, for example . Thus each value of the data point when plugged into the model will give you a linear equation in the parameters a,b,c. For the collection of these linear equations you can write the matrix equation (note that here contains the parameters as its elements).
Use the normal equation or the equation to find the best fit parameter values ().
The easiest way of solving the above equations is to use matlab or mathematica, which have built-in functions for matrix manipulations. (most of the programming languages like C or python also may have corresponding libraries). However writing the code is better in terms of gaining skills and making your foundations stronger.
Please note that even if you find splines easy to use for interpolation, regression is a better choice for modeling as the resultant equation is simpler.
Remember that when the parameters are found by minimizing the magnitude square of the error vector using calculus, one would get the same result for the best fit parameters. That method is known as “the least square method“.
You may post your doubts below, and if you are at the Ahmedabad University then catch me after a class.