Regression
Correlation
When studying random variables, it may be useful to check their independence or the nature of their relationships. The analysis of the relationships between variables is generally carried out using graphical tools (point clouds), combined with numerical indicators (correlation coefficients).
The best known of the correlation coefficients is that of Pearson. It allows to quantify the intensity of the link between the variables. It is constructed by relating the covariance of variables X and Y to the product of the standard deviations of X and Y. In this way, results between -1 and 1 are obtained.
- A correlation coefficient close to 1 indicates a strong positive correlation: i.e. the variables move in the same direction and are closely linked
- Un coefficient de corrélation proche de -1, met en évidence une corrélation négative forte : c'est à dire que les variables évoluent dans le sens opposé et qu'elles sont étroitement liées
- A correlation coefficient close to 0 shows that the variables are not linearly correlated. This is not sufficient to conclude that the variables are independent.
Since the Pearson correlation coefficient is very sensitive to the presence of outliers, it is strongly recommended to use it in conjunction with a graph to avoid misinterpretation.
The use of correlation coefficients is not specific to continuous variables, the correlation of rank variables can also be measured. The Kendall coefficient, for example, can be used to quantify the relationships between ranks (or rankings) of observations.
When we want to study the correlation of several variables, we usually use a matrix of scatterplots.
SOSstat allows you to analyze the correlation of several variables. The correlation (or covariance) results are returned in a matrix, which can also be visualized as an image with a color scale corresponding to the correlation level.
The simple regression
After having demonstrated the existence of a correlation between two random variables, we can legitimately try to model this relationship. The purpose of regression is to determine this model. Although the notions of correlation and regression are often linked, it is worth recalling their differences.
The correlation aims to quantify the intensity of the relationship between two variables (their degree of dependence) in order to determine whether these values are statistically moving in the same direction or in the opposite direction. The regression approach is a little different. From the value pairs (x,y), we try to create a model such that we can predict the values of Y knowing the values of X. We speak of "explanatory variable" for X and "explained variable" for Y.
Model Estimation
Let x and x be two dependent random variables. The prediction of y,
If we put the linear prediction model in the form:
Then the least-squares application allows to establish (see demonstration below)
the values of
Quality of the model
Unfortunately, a good application of the least squares method does not guarantee that the model is of good quality. Indeed, the quality of the predictions will depend on:
- The choice of the model. A linear model can be used when the relationship between the variables is not linear.
- The intensity of the relationship between the two variables. If Y dependence on X is low, then it will be very difficult to predict the achievements of Y knowing X.
To verify that the model has a satisfactory "explanatory" character, the decomposition of variances is studied. In the regression, it is assumed that the total observed variability (SCT) is the sum of the variability explained by the model (SCM) and the residual variability (SCE), the deviations that the model cannot predict. This relationship is expressed as the sum of quadratic deviations.
where
- SCT is the sum of the total squares, i.e. the variability of the measurements around the mean.
- SCE is the sum of the squares of the measurements around the estimates, i.e. the residual variability.
- SCM is the sum of the squares of the estimates around the mean, i.e. the variability explained by the model.
This decomposition of the sources of variation makes it possible to establish the
coefficient of determination:
As can be seen in the animation below, implementing a simple regression is extremely easy in SOSstat. The regression graph can display the confidence interval of the model as well as the prediction interval of the observations. In addition, residuals are analyzed to verify their independence and normality.
SOSstat also provides numerical results, including model coefficients
(several models are available) and the coefficient of determination :
Multiple regression
A natural extension of simple regression, presented above, is multiple or multilinear regression, which allows a model involving several input variables to be identified.
Presentation of the problem
In this section, we will present the mathematical formalism used to generalize the least squares method to systems with several variables. Matrix calculation methods are particularly appreciated for solving systems of n equations to n unknown due to their synthetic writing and simple computerization.
When we carry out a regression, our aim is to identify the parameters of a
mathematical model. Consider the case of a system comprising two factors A and
B, which we wish to represent with a first-order model, i.e. with two factors
That can be written in a generic way:
This mathematical model can be applied for each combination of factor levels (4
experiments if both factors have two levels each). Thus we can express the
answers according to
This system of four equations can be represented in matrix form by adopting the following notations:
Y is the vector of the system's responses
a is the vector of the model's coefficients, i.e. the effects and interactions (which are organized in a very precise way).
X is the experiment matrix that describes the succession of experiments performed or observations of explanatory variables
In this representation y and X are known and we try to identify the coefficients of a.
Solution
The mathematical model we have sought to identify is purely theoretical since it represents a deterministic relationship between the explained variable Y and the explanatory variables X. To take into account random phenomena, a term representing residuals is added to the model (Residuals are the deviations not covered by the model).
To solve this system of equations, the multiple or multilinear regression technique is used. The latter seeks a solution that minimizes the sum of the squares of the differences between the model and the experimental results (least squares).
The solution of a linear system with the least squares criterion is given by the relationship:
SOSstat offers a multiple regression module to build complex models mixing continuous and discrete variables. SOSstat calculates the model coefficients and performs tests to determine if the model coefficients are significant. Numerous residue analysis graphs complete the calculations to identify isolated values or to highlight a fit fault (i.e. an inappropriate model).
SOSstat also offers a prediction module using the regression model. The regression model allows the user to easily find the configuration of the variables to target the desired response.