Contacts

Multivariate statistical analysis. Introduction to multivariate statistical analysis. Basic concepts of the factor analysis method, the essence of the tasks it solves

There are situations in which random variability was represented by one or two random variables, signs.

For example, when studying a statistical population of people, we are interested in height and weight. In this situation, no matter how many people there are in the statistical population, we can always plot a scatterplot and see the whole picture. However, if there are three features, for example, a feature is added - the age of a person, then the scatterplot must be built in three-dimensional space. Representing a set of points in three-dimensional space is already quite difficult.

In reality, in practice, each observation is represented not by one, two, or three numbers, but by some noticeable set of numbers that describe dozens of features. In this situation, to construct a scatterplot, it would be necessary to consider multidimensional spaces.

The branch of statistics devoted to the study of experiments with multivariate observations is called multivariate statistical analysis.

The measurement of several features (properties of an object) at once in one experiment is generally more natural than the measurement of any one or two. Therefore, potentially multivariate statistical analysis has a wide field of application.

Multivariate statistical analysis includes the following sections:

Factor analysis;

Discriminant analysis;

cluster analysis;

Multidimensional scaling;

Quality control methods.

Factor analysis

In the study of complex objects and systems (for example, in psychology, biology, sociology, etc.), the quantities (factors) that determine the properties of these objects very often cannot be measured directly, and sometimes even their number and meaningful meaning are not known. But other quantities may be available for measurement, one way or another depending on the factors of interest. Moreover, when the influence of an unknown factor of interest to us manifests itself in several measured features, these features can show a close relationship with each other and the total number of factors can be much less than the number of measured variables.

Factor analysis methods are used to identify factors influencing the measured variables.

An example of the use of factor analysis is the study of personality traits based on psychological tests. Personality properties are not amenable to direct measurement, they can only be judged by a person's behavior or the nature of the answers to certain questions. To explain the results of the experiments, they are subjected to factor analysis, which makes it possible to identify those personal properties that influence the behavior of the individuals being tested.


Various models of factor analysis are based on the following hypothesis: the observed or measured parameters are only indirect characteristics of the object or phenomenon being studied; in fact, there are internal (hidden, latent, not directly observable) parameters and properties, the number of which is small and which determine the values ​​of the observed parameters. These internal parameters are called factors.

The task of factor analysisis the representation of the observed parameters in the form of linear combinations of factors and, perhaps, some additional, insignificant perturbations.

The first stage of factor analysis, as a rule, is the selection of new features, which are linear combinations of the former ones and "absorb" most of the total variability of the observed data, and therefore convey most of the information contained in the original observations. This is usually done using principal component method, although sometimes other techniques are used (maximum likelihood method).

The principal component method is reduced to the choice of a new orthogonal coordinate system in the observation space. As the first principal component, the direction along which the array of observations has the greatest scatter is chosen, the choice of each subsequent principal component occurs so that the scatter of observations is maximum and that this principal component is orthogonal to other principal components selected earlier. However, the factors obtained by the method of principal components usually do not lend themselves to a sufficiently visual interpretation. Therefore, the next step in factor analysis is the transformation, rotation of factors to facilitate interpretation.

Discriminant Analysis

Let there be a set of objects divided into several groups, and for each object it is possible to determine to which group it belongs. For each object there are measurements of several quantitative characteristics. It is necessary to find a way how, based on these characteristics, you can find out the group to which the object belongs. This will allow you to specify the groups to which new objects of the same collection belong. To solve the problem, apply methods of discriminant analysis.

Discriminant Analysis- this is a section of statistics, the content of which is the development of methods for solving problems of distinguishing (discrimination) of objects of observation according to certain characteristics.

Let's look at some examples.

Discriminant analysis proves to be handy in handling the test results of individuals when it comes to hiring for a particular position. In this case, it is necessary to divide all candidates into two groups: “suitable” and “not suitable”.

The use of discriminant analysis is possible by the banking administration to assess the financial state of the affairs of clients when issuing a loan to them. The Bank, according to a number of features, classifies them into reliable and unreliable.

Discriminant analysis can be used as a method of dividing a set of enterprises into several homogeneous groups according to the values ​​of any indicators of production and economic activity.

The methods of discriminant analysis make it possible to construct functions of the measured characteristics, the values ​​of which explain the division of objects into groups. It is desirable that these functions (discriminating features) was a bit. In this case, the results of the analysis are easier to meaningfully interpret.

Due to its simplicity, linear discriminant analysis plays a special role, in which classifying features are chosen as linear functions of primary features.

cluster analysis

Cluster analysis methods make it possible to divide the studied set of objects into groups of "similar" objects, called clusters.

Word cluster of English origin - cluster translates as brush, bunch, group, swarm, cluster.

Cluster analysis solves the following tasks:

Carries out the classification of objects, taking into account all those features that characterize the object. The very possibility of classification advances us to a deeper understanding of the totality under consideration and the objects included in it;

Sets the task of checking the presence of an a priori given structure or classification in the existing population. Such verification makes it possible to use the standard hypothetical-deductive scheme of scientific research.

Most clustering (hierarchical group) methods are agglomerative(unifying) - they start with the creation of elementary clusters, each of which consists of exactly one initial observation (one point), and at each subsequent step, the two closest clusters are combined into one.

The moment of stopping this process can be set by the researcher (for example, by specifying the required number of clusters or the maximum distance at which the union is achieved).

A graphical representation of the process of combining clusters can be obtained using dendrograms- a cluster union tree.

Consider the following example. Let's classify five enterprises, each of which is characterized by three variables:

x 1– average annual cost of fixed production assets, billion rubles;

x 2- material costs per 1 rub. manufactured products, kop.;

x 3- the volume of manufactured products, billion rubles.

The textbook was created on the basis of the author's experience in teaching multivariate statistical analysis and econometrics courses. Contains materials on discriminant, factorial, regression, correspondence analysis and time series theory. Approaches to problems of multidimensional scaling and some other problems of multivariate statistics are outlined.

Grouping and censoring.
The task of forming groups of sample data in such a way that the grouped data can provide almost the same amount of information for decision making as the sample before grouping is solved by the researcher in the first place. The goals of grouping, as a rule, are to reduce the amount of information, simplify calculations and make data more visible. Some statistical tests are initially focused on working with a grouped sample. In certain aspects, the grouping problem is very close to the classification problem, which will be discussed in more detail below. Simultaneously with the task of grouping, the researcher also solves the problem of censoring the sample, i.e. exclusion from it of outlying data, which, as a rule, are the result of gross observational errors. Naturally, it is desirable to ensure the absence of such errors even in the course of the observations themselves, but this is not always possible. The simplest methods for solving these two problems are discussed in this chapter.

Table of contents
1 Preliminary information
1.1 Analysis and algebra
1.2 Probability theory
1.3 Mathematical statistics
2 Multivariate distributions
2.1 Random vectors
2.2 Independence
2.3 Numerical characteristics
2.4 Normal distribution in the multivariate case
2.5 Correlation theory
3 Grouping and censoring
3.1 One-dimensional grouping
3.2 One-dimensional censoring
3.3 Crossing tables
3.3.1 Independence hypothesis
3.3.2 Homogeneity hypothesis
3.3.3 Correlation field
3.4 Multidimensional grouping
3.5 Multidimensional censoring
4 Non-numeric data
4.1 Introductory remarks
4.2 Comparison scales
4.3 Expert judgment
4.4 Expert groups
5 Confidence sets
5.1 Confidence intervals
5.2 Confidence sets
5.2.1 Multidimensional parameter
5.2.2 Multivariate sampling
5.3 Tolerant sets
5.4 Small sample
6 Regression analysis
6.1 Problem statement
6.2 Searching for GMS
6.3 Restrictions
6.4 Plan matrix
6.5 Statistical forecast
7 Analysis of variance
7.1 Introductory remarks
7.1.1 Normality
7.1.2 Homogeneity of dispersions
7.2 One factor
7.3 Two factors
7.4 General case
8 Dimensionality reduction
8.1 Why classification is needed
8.2 Model and examples
8.2.1 Principal component analysis
8.2.2 Extreme feature grouping
8.2.3 Multidimensional scaling
8.2.4 Selection of indicators for discriminant analysis
8.2.5 Feature selection in a regression model
9 Discriminant analysis
9.1 Applicability of the model
9.2 Linear predictive rule
9.3 Practical recommendations
9.4 One example
9.5 More than two classes
9.6 Checking the quality of discrimination
10 Heuristic methods
10.1 Extreme grouping
10.1.1 Criterion of squares
10.1.2 Module criterion
10 2 Pleiades method
11 Principal component analysis
11 1 Statement of the problem
112 Calculation of principal components
11.3 Example
114 Principal component properties
11.4.1 Self-reproducibility
11.4.2 Geometric properties
12 Factor analysis
12.1 Statement of the problem
12.1.1 Communication with principal components
12.1.2 Unambiguous decision
12.2 Mathematical model
12.2.1 Conditions for At A
12.2.2 Conditions on the load matrix. centroid method
12.3 Latent factors
12.3.1 Bartlett method
12.3.2 Thomson method
12.4 Example
13 Digitization
13.1 Correspondence analysis
13.1.1 Chi-square distance
13.1.2 Digitization for discriminant analysis problems
13.2 More than two variables
13.2.1 Using a binary data matrix as a mapping matrix
13.2.2 Maximum correlations
13.3 Dimension
13.4 Example
13.5 Mixed data case
14 Multidimensional scaling
14.1 Introductory remarks
14.2 Thorgerson model
14.2.1 Stress criterion
14.3 Thorgerson's algorithm
14.4 Individual differences
15 Time series
15.1 General
15.2 Randomness criteria
15.2.1 Peaks and pits
15.2.2 Phase length distribution
15.2.3 Criteria based on rank correlation
15.2.4 Correlogram
15.3 Trend and seasonality
15.3.1 Polynomial trends
15.3.2 Selecting the degree of trend
15.3.3 Smoothing
15.3.4 Estimating seasonal fluctuations
A Normal distribution
In Distribution X2
With Student's t-distribution
D Fisher distribution.


Free download e-book in a convenient format, watch and read:
Download the book Multivariate statistical analysis, Dronov SV, 2003 - fileskachat.com, fast and free download.

Download pdf
Below you can buy this book at the best discounted price with delivery throughout Russia.

Dispersion analysis.

The purpose of the analysis of variance is to test the statistical significance of the difference between the means (for groups or variables). This check is carried out by splitting the sum of squares into components, i.e. by splitting the total variance (variation) into parts, one of which is due to random error (i.e., intragroup variability), and the second is associated with the difference in mean values. The last component of the variance is then used to analyze the statistical significance of the difference between the means. If this difference significant, null hypothesis rejected and an alternative hypothesis is accepted that there is a difference between the means.

Splitting the sum of squares. For a sample size of n, the sample variance is calculated as the sum of the squared deviations from the sample mean divided by n-1 (sample size minus one). Thus, for a fixed sample size n, the variance is a function of the sum of squares (deviations). The analysis of variance is based on the division of the variance into parts or components, i.e. The sample is divided into two parts in which the mean and the sum of the squared deviations are calculated. The calculation of the same indicators for the sample as a whole gives a larger value of dispersion, which explains the discrepancy between the group means. Thus, analysis of variance allows one to explain intragroup variability, which cannot be changed when studying the entire group as a whole.

Significance testing in ANOVA is based on comparing the component of the variance due to between-group and the component of variance due to within-group spread (called the mean squared error). If the null hypothesis is correct (the equality of the means in the two populations), then we can expect a relatively small difference in the sample means due to purely random variability. Therefore, under the null hypothesis, the intra-group variance will almost coincide with the total variance calculated without taking into account group membership. The obtained within-group variances can be compared using the F-test, which tests whether the ratio of variances is indeed significantly greater than 1.

Advantages: 1) analysis of variance is much more efficient and, for small samples, because more informative; 2) analysis of variance allows you to detect effects interactions between factors and, therefore, allows testing more complex hypotheses

Principal Component Method consists in linear dimensionality reduction, in which pairwise orthogonal directions of maximum variation of the input data are determined, after which the data are projected onto the space of lower dimension generated by the components with the greatest variation.

Principal component analysis is a part of factor analysis, which consists in combining two correlated variables into one factor. If the two-variable example is extended to include more variables, the calculations become more complex, but the basic principle of representing two or more dependent variables by a single factor remains valid.

When reducing the number of variables, the decision about when to stop the factor extraction procedure mainly depends on the point of view of what counts as small "random" variability. With repeated iterations, factors with less and less variance are distinguished.

Centroid method for determining factors.

The centroid method is used in cluster analysis. In this method, the distance between two clusters is defined as the distance between their centers of gravity in the unweighted centroid method.

The weighted centroid method (median) is identical to the non-weighted method, except that weights are used in the calculations to take into account the difference between cluster sizes (i.e., the number of objects in them). Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

cluster analysis.

The term cluster analysis actually includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures, i.e. identify clusters of similar objects. In fact, cluster analysis is not so much an ordinary statistical method as a "set" of various algorithms for "distributing objects into clusters". There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about classes, but are still in the descriptive stage of research. It should be understood that cluster analysis determines the "most possibly meaningful decision".

Tree clustering algorithm. The purpose of this algorithm is to combine objects into sufficiently large clusters using some measure of similarity or distance between objects. A typical result of such clustering is a hierarchical tree, which is a diagram. The diagram starts with each object in the class (on the left side of the diagram). Now imagine that gradually (in very small steps) you "weaken" your criterion for what objects are unique and what are not. In other words, you lower the threshold related to the decision to combine two or more objects into one cluster. As a result, you link more and more objects together and aggregate (combine) more and more clusters of increasingly different elements. Finally, in the last step, all objects are merged together. In these charts, the horizontal axes represent the pooling distance (in vertical dendrograms, the vertical axes represent the pooling distance). So, for each node in the graph (where a new cluster is formed), you can see the amount of distance for which the corresponding elements are linked into a new single cluster. When data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by various branches. As a result of successful analysis by the join method, it becomes possible to detect clusters (branches) and interpret them.

Discriminant analysis is used to decide which variables distinguish (discriminate) between two or more emerging populations (groups). The most common application of discriminant analysis is to include many variables in a study in order to determine those that best separate populations from each other. In other words, you want to build a "model" that best predicts which population a given sample will belong to. In the following discussion, the term "in the model" will be used to refer to the variables used in predicting population membership; about variables not used for this, we will say that they are "outside the model".

In the stepwise analysis of discriminant functions, the discrimination model is built step by step. More precisely, at each step, all variables are looked through and the one that makes the greatest contribution to the difference between the sets is found. This variable must be included in the model at this step, and the transition to the next step occurs.

It is also possible to go in the opposite direction, in which case all variables will be included in the model first, and then variables that make little contribution to the predictions will be eliminated at each step. Then, as a result of a successful analysis, only the "important" variables in the model can be stored, that is, those variables whose contribution to discrimination is greater than the rest.

This step-by-step procedure is "guided" by the corresponding F value for inclusion and the corresponding F value for exclusion. The F value of a statistic for a variable indicates its statistical significance in discriminating between populations, that is, it is a measure of the variable's contribution to predicting population membership.

For two groups, discriminant analysis can also be considered as a multiple regression procedure. If you code two groups as 1 and 2 and then use these variables as dependent variables in a multiple regression, you will get results similar to those you would get with discriminant analysis. In general, in the case of two populations, you fit a linear equation of the following type:

Group = a + b1*x1 + b2*x2 + ... + bm*xm

where a is a constant and b1...bm are the regression coefficients. The interpretation of the results of the problem with two populations closely follows the logic of applying multiple regression: variables with the largest regression coefficients contribute the most to discrimination.

If there are more than two groups, then more than one discriminant function can be evaluated, similar to what was done earlier. For example, when there are three populations, you can evaluate: (1) a function to discriminate between population 1 and populations 2 and 3 taken together, and (2) another function to discriminate between population 2 and population 3. For example, you can have one function to discriminate between those high school graduates who go to college versus those who don't (but want to get a job or go to school), and a second function to discriminate between those graduates who want to get a job versus those who don't. who wants to go to school. The coefficients b in these discriminating functions can be interpreted in the same way as before.

Canonical correlation.

Canonical analysis is designed to analyze dependencies between lists of variables. More specifically, it allows you to explore the relationship between two sets of variables. When calculating the canonical roots, the eigenvalues ​​of the correlation matrix are calculated. These values ​​are equal to the proportion of variance explained by the correlation between the respective canonical variables. In this case, the resulting share is calculated relative to the dispersion of canonical variables, i.e. weighted sums over two sets of variables; thus, the eigenvalues ​​do not show the absolute meaning explained in the respective canonical variables.

If we take the square root of the obtained eigenvalues, we get a set of numbers that can be interpreted as correlation coefficients. Since they are canonical variables, they are also called canonical correlations. Like the eigenvalues, the correlations between canonical variables sequentially extracted at each step decrease. However, other canonical variables can also be significantly correlated, and these correlations often allow for a fairly meaningful interpretation.

The criterion for the significance of canonical correlations is relatively simple. First, canonical correlations are evaluated one after the other in descending order. Only those roots that turned out to be statistically significant are left for further analysis. Although in reality the calculations are a little different. The program first evaluates the significance of the entire set of roots, then the significance of the set remaining after removing the first root, the second root, and so on.

Studies have shown that the test used detects large canonical correlations even with a small sample size (for example, n = 50). Weak canonical correlations (eg R = .3) require large sample sizes (n > 200) to be detected 50% of the time. Note that canonical correlations of small size are usually of no practical value, since they correspond to a small real variability of the original data.

Canonical weights. After determining the number of significant canonical roots, the question arises of the interpretation of each (significant) root. Recall that each root actually represents two weighted sums, one for each set of variables. One way of interpreting the "meaning" of each canonical root is to consider the weights associated with each set of variables. These weights are also called canonical weights.

In the analysis, it is usually used that the greater the assigned weight (ie, the absolute value of the weight), the greater the contribution of the corresponding variable to the value of the canonical variable.

If you are familiar with multiple regression, you can use the canonical weights interpretation used for the beta weights in the multiple regression equation. Canonical weights are, in a sense, analogous to the partial correlations of the variables corresponding to the canonical root. Thus, consideration of canonical weights makes it possible to understand the "meaning" of each canonical root, i.e. see how the specific variables in each set affect the weighted sum (i.e. the canonical variable).

Parametric and non-parametric methods for evaluating results.

Parametric methods based on the sampling distribution of certain statistics. In short, if you know the distribution of the observed variable, you can predict how the statistics used will "behave" in repeated samples of equal size - i.e. how it will be distributed.

In practice, the use of parametric methods is limited due to the volume or sample size available for analysis; problems with accurate measurement of features of the observed object

Thus, there is a need for procedures to handle "low quality" data from small sample sizes with variables whose distribution is little or nothing known. Non-parametric methods are just designed for those situations that often arise in practice, when the researcher knows nothing about the parameters of the population under study (hence the name of the methods - non-parametric). In more technical terms, non-parametric methods do not rely on the estimation of parameters (such as mean or standard deviation) in describing the sampling distribution of the quantity of interest. Therefore, these methods are sometimes also called parameter-free or freely distributed.

Essentially, for every parametric test there is at least one non-parametric counterpart. These criteria can be classified into one of the following groups:

criteria for differences between groups (independent samples);

criteria for differences between groups (dependent samples);

criteria for dependence between variables.

Differences between independent groups. Typically, when there are two samples (for example, men and women) that you want to compare with respect to the mean of some variable of interest, you use a t-test for independents. Nonparametric alternatives to this test are: the Wald-Wolfowitz series test, the Mann-Whitney U test, and the two-sample Kolmogorov-Smirnov test. If you have multiple groups, you can use ANOVA. Its non-parametric counterparts are: Kruskal-Wallis rank analysis of variance and the median test.

Differences between dependent groups. If you want to compare two variables that belong to the same sample (for example, the math performance of students at the beginning and at the end of the semester), then the t-test for dependent samples is usually used. Alternative non-parametric tests are: sign test and Wilcoxon test of paired comparisons. If the variables in question are categorical in nature or are categorized (i.e., represented as frequencies that fall into certain categories), then McNemar's chi-square test will be appropriate. If more than two variables from the same sample are considered, repeated measures analysis of variance (ANOVA) is usually used. An alternative non-parametric method is Friedman's rank analysis of variance or Cochran's Q test (the latter is used, for example, if the variable is measured on a nominal scale). Cochran's Q test is also used to assess changes in frequencies (shares).

Dependencies between variables. In order to evaluate the dependence (relationship) between two variables, the correlation coefficient is usually calculated. Non-parametric analogues of the standard Pearson correlation coefficient are Spearman's R statistic, Kendall's tau, and Gamma coefficient. Additionally, a criterion of dependence between several variables is available, the so-called Kendall's concordance coefficient. This test is often used to assess the consistency of opinions of independent experts (judges), in particular, scores given to the same subject.

If the data is not normally distributed and the measurements contain ranked information at best, then calculating the usual descriptive statistics (eg, mean, standard deviation) is not very informative. For example, it is well known in psychometry that the perceived intensity of stimuli (eg, the perceived brightness of light) is a logarithmic function of the actual intensity (luminance measured in objective units, lux). In this example, the usual estimation of the mean (the sum of the values ​​divided by the number of stimuli) does not give a correct idea of ​​the mean value of the actual stimulus intensity. (In the example discussed, the geometric mean should be computed rather.) Nonparametric statistics compute a diverse set of measures of position (mean, median, mode, etc.) and dispersion (variance, harmonic mean, quartile range, etc.) to represent more the "big picture" of the data.

Social and economic objects, as a rule, are characterized by a fairly large number of parameters that form multidimensional vectors, and the problems of studying the relationships between the components of these vectors are of particular importance in economic and social studies, and these relationships must be identified on the basis of a limited number of multidimensional observations.

Multivariate statistical analysis is a branch of mathematical statistics that studies the methods of collecting and processing multivariate statistical data, their systematization and processing in order to identify the nature and structure of relationships between the components of the studied multivariate attribute, and to draw practical conclusions.

Note that data collection methods may vary. So, if the world economy is being studied, then it is natural to take countries as objects on which the values ​​of the vector X are observed, but if the national economic system is being studied, then it is natural to observe the values ​​of the vector X in the same (of interest to the researcher) country at different points in time .

Statistical methods such as multiple correlation and regression analysis are traditionally studied in the courses of probability theory and mathematical statistics, the discipline "Econometrics" is devoted to the consideration of applied aspects of regression analysis.

This manual is devoted to other methods of studying multivariate general populations based on statistical data.

Methods for reducing the dimension of a multidimensional space allow, without significant loss of information, to move from the original system of a large number of observed interrelated factors to a system of a significantly smaller number of hidden (unobservable) factors that determine the variation of the initial features. The first chapter describes the methods of component and factor analysis, which can be used to identify objectively existing, but not directly observable patterns using principal components or factors.

Multidimensional classification methods are designed to divide collections of objects (characterized by a large number of features) into classes, each of which should include objects that are homogeneous or similar in a certain sense. Such a classification based on statistical data on the values ​​of features on objects can be carried out using the methods of cluster and discriminant analysis, discussed in the second chapter (Multivariate statistical analysis using “STATISTICA”).

The development of computer technology and software contributes to the widespread introduction of multivariate statistical analysis methods into practice. Application packages with a convenient user interface, such as SPSS, Statistica, SAS, etc., remove the difficulties in applying these methods, which are the complexity of the mathematical apparatus based on linear algebra, probability theory and mathematical statistics, and the cumbersomeness of calculations.

However, the use of programs without understanding the mathematical essence of the algorithms used contributes to the development of the researcher's illusion of the simplicity of using multivariate statistical methods, which can lead to incorrect or unreasonable results. Significant practical results can be obtained only on the basis of professional knowledge in the subject area, supported by the knowledge of mathematical methods and application packages in which these methods are implemented.

Therefore, for each of the methods considered in this book, basic theoretical information is given, including algorithms; the implementation of these methods and algorithms in application packages is discussed. The considered methods are illustrated with examples of their practical application in economics using the SPSS package.

The manual is written on the basis of the experience of reading the course "Multivariate statistical methods" to students of the State University of Management. For a more detailed study of the methods of applied multivariate statistical analysis, books are recommended.

It is assumed that the reader is well acquainted with the courses of linear algebra (for example, in the volume of the textbook and the appendix to the textbook), probability theory and mathematical statistics (for example, in the volume of the textbook).

Introduction

Chapter 1 Multiple Regression Analysis

Chapter 2. Cluster analysis

Chapter 3. Factor Analysis

Chapter 4. Discriminant Analysis

Bibliography

Introduction

Initial information in socio-economic studies is most often presented as a set of objects, each of which is characterized by a number of features (indicators). Since the number of such objects and features can reach tens and hundreds, and the visual analysis of these data is ineffective, the problems of reducing, concentrating the initial data, revealing the structure and the relationship between them based on the construction of generalized characteristics of a set of features and a set of objects arise. Such problems can be solved by methods of multivariate statistical analysis.

Multivariate statistical analysis is a section of statistics devoted to mathematical methods aimed at identifying the nature and structure of relationships between the components of the research and intended to obtain scientific and practical conclusions.

The main attention in multivariate statistical analysis is paid to mathematical methods for constructing optimal plans for collecting, systematizing and processing data, aimed at identifying the nature and structure of relationships between the components of the studied multivariate attribute and intended to obtain scientific and practical conclusions.

The initial array of multidimensional data for conducting multivariate analysis is usually the results of measuring the components of a multidimensional attribute for each of the objects of the studied population, i.e. a sequence of multivariate observations. A multivariate attribute is most often interpreted as , and a sequence of observations as a sample from the general population. In this case, the choice of the method of processing the initial statistical data is made on the basis of certain assumptions regarding the nature of the distribution law of the studied multidimensional attribute.

1. Multivariate statistical analysis of multivariate distributions and their main characteristics covers situations where the processed observations are of a probabilistic nature, i.e. interpreted as a sample from the corresponding general population. The main tasks of this subsection include: statistical estimation of the studied multivariate distributions and their main parameters; study of the properties of the statistical estimates used; study of probability distributions for a number of statistics, which are used to build statistical criteria for testing various hypotheses about the probabilistic nature of the analyzed multivariate data.

2. Multivariate statistical analysis of the nature and structure of the interrelations of the components of the studied multivariate attribute combines the concepts and results inherent in such methods and models as analysis, analysis of variance, analysis of covariance, factor analysis, etc. Methods belonging to this group include both algorithms based on the assumption of the probabilistic nature of the data, and methods that do not fit into the framework of any probabilistic model (the latter are often referred to as methods).

3. Multidimensional statistical analysis of the geometric structure of the studied set of multivariate observations combines the concepts and results inherent in such models and methods as discriminant analysis, cluster analysis, multidimensional scaling. Nodal for these models is the concept of distance, or a measure of proximity between the analyzed elements as points of some space. In this case, both objects (as points specified in the feature space) and features (as points specified in the object space) can be analyzed.

The applied value of multivariate statistical analysis consists mainly in solving the following three problems:

the task of statistical study of the dependencies between the indicators under consideration;

the task of classifying elements (objects or features);

· the task of reducing the dimension of the feature space under consideration and selecting the most informative features.

Multiple regression analysis is designed to build a model that allows the values ​​of independent variables to obtain estimates of the values ​​of the dependent variable.

Logistic regression for solving the classification problem. This is a type of multiple regression, the purpose of which is to analyze the relationship between several independent variables and a dependent variable.

Factor analysis deals with the determination of a relatively small number of hidden (latent) factors, the variability of which explains the variability of all observed indicators. Factor analysis is aimed at reducing the dimension of the problem under consideration.

Cluster and discriminant analysis are designed to divide collections of objects into classes, each of which should include objects that are homogeneous or close in a certain sense. In cluster analysis, it is not known in advance how many groups of objects will turn out and what size they will be. Discriminant analysis divides objects into pre-existing classes.

Chapter 1 Multiple Regression Analysis

Assignment: Research of the housing market in Orel (Soviet and Northern regions).

The table shows data on the price of apartments in Orel and on various factors that determine it:

· total area;

The area of ​​the kitchen

· living space;

type of house

the number of rooms. (Fig.1)

Rice. 1 Initial data

In the column "Region" the designations are used:

3 - Soviet (elite, belongs to the central regions);

4 - North.

In the column "Type of house":

1 - brick;

0 - panel.

Required:

1. Analyze the relationship of all factors with the "Price" indicator and among themselves. Select the factors most suitable for building a regression model;

2. Construct a dummy variable that reflects the belonging of the apartment to the central and peripheral areas of the city;

3. Build a linear regression model for all factors, including a dummy variable in it. Explain the economic meaning of the parameters of the equation. Evaluate the quality of the model, the statistical significance of the equation and its parameters;

4. Distribute the factors (except for the dummy variable) according to the degree of influence on the “Price” indicator;

5. Build a linear regression model for the most influential factors, leaving a dummy variable in the equation. Evaluate the quality and statistical significance of the equation and its parameters;

6. Justify the expediency or inexpediency of including a dummy variable in the equation of paragraphs 3 and 5;

7. Estimate interval estimates of the parameters of the equation with a probability of 95%;

8. Determine how much an apartment with a total area of ​​74.5 m² in an elite (peripheral) area will cost.

Performance:

1. After analyzing the relationship of all factors with the “Price” indicator and among themselves, the factors most suitable for building a regression model were selected using the “Forward” inclusion method:

A) the total area;

C) the number of rooms.

Included/excluded variables(a)

a Dependent variable: Price

2. Variable X4 "Region" is a dummy variable, since it has 2 values: 3-belonging to the central region "Soviet", 4- to the peripheral region "Severny".

3. Let's build a linear regression model for all factors (including the dummy variable X4).

Received model:

Evaluation of the quality of the model.

Standard error = 126.477

Durbin-Watson ratio = 2.136

Checking the Significance of the Regression Equation

F-Fisher test value = 41.687

4. Let's build a linear regression model with all factors (except for the dummy variable X4)

According to the degree of influence on the “Price” indicator, they were distributed:

The most significant factor is the total area (F= 40.806)

The second most important factor is the number of rooms (F= 29.313)

5. Included/excluded variables

a Dependent variable: Price

6. Let's build a linear regression model for the most influential factors with a dummy variable, in our case it is one of the influential factors.

Received model:

Y \u003d 348.349 + 35.788 X1 -217.075 X4 +305.687 X7

Evaluation of the quality of the model.

Determination coefficient R2 = 0.807

Shows the proportion of variation of the resulting trait under the influence of the studied factors. Consequently, about 89% of the variation of the dependent variable is taken into account and due to the influence of the included factors in the model.

Multiple correlation coefficient R = 0.898

Shows the closeness of the relationship between the dependent variable Y with all explanatory factors included in the model.

Standard error = 126.477

Durbin-Watson ratio = 2.136

Checking the Significance of the Regression Equation

F-Fisher test value = 41.687

The regression equation should be recognized as adequate, the model is considered significant.

The most significant factor is the number of rooms (F=41,687)

The second most important factor is the total area (F= 40.806)

The third most important factor is the region (F= 32.288)

7. The dummy variable X4 is a significant factor, so it is advisable to include it in the equation.

The interval estimates of the equation parameters show the results of forecasting by the regression model.

With a probability of 95%, the volume of sales in the forecast month will be from 540.765 to 1080.147 million rubles.

8. Determination of the cost of an apartment in an elite area

For 1 room U \u003d 348.349 + 35.788 * 74, 5 - 217.075 * 3 + 305.687 * 1

For 2 rooms U \u003d 348.349 + 35.788 * 74, 5 - 217.075 * 3 + 305.687 * 2

For 3 rooms U \u003d 348.349 + 35.788 * 74, 5 - 217.075 * 3 + 305.687 * 3

in the peripheral

For 1 room U \u003d 348.349 + 35.788 * 74, 5 - 217.075 * 4 + 305.687 * 1

For 2 rooms U \u003d 348.349 + 35.788 * 74, 5 - 217.075 * 4 + 305.687 * 2

For 3 rooms U \u003d 348.349 + 35.788 * 74, 5 - 217.075 * 4 + 305.687 * 3

Chapter 2. Cluster analysis

Assignment: Study of the structure of monetary expenditures and savings of the population.

The table shows the structure of cash expenditures and savings of the population by regions of the Central Federal District of the Russian Federation in 2003. For the following indicators:

PTIOU - purchase of goods and payment for services;

· OPiV - obligatory payments and contributions;

PN - purchase of real estate;

· PFA – increase in financial assets;

· DR - increase (decrease) of money in the hands of the population.

Rice. 8 Initial data

Required:

1) determine the optimal number of clusters for dividing regions into homogeneous groups according to all grouping characteristics simultaneously;

2) carry out the classification of areas by a hierarchical method with an algorithm of intergroup relations and display the results in the form of a dendrogram;

3) analyze the main priorities of cash spending and savings in the resulting clusters;

Performance:

1) Determine the optimal number of clusters for dividing regions into homogeneous groups according to all grouping characteristics simultaneously;

To determine the optimal number of clusters, you need to use the Hierarchical cluster analysis and refer to the table "Steps of agglomeration" to the column "Coefficients".

These coefficients imply the distance between two clusters, determined on the basis of the selected distance measure (Euclidean distance). At the stage when the measure of distance between two clusters increases abruptly, the process of merging into new clusters must be stopped.

As a result, the optimal number of clusters is considered to be equal to the difference between the number of observations (17) and the step number (14), after which the coefficient increases abruptly. Thus, the optimal number of clusters is 3. (Fig. 9)

statistical mathematical analysis cluster

Rice. 9 Table “Sintering steps”

2) Carry out the classification of areas by a hierarchical method with an algorithm of intergroup relations and display the results in the form of a dendrogram;

Now, using the optimal number of clusters, we classify areas using a hierarchical method. And in the output we turn to the table "Belonging to clusters". (Fig.10)

Rice. 10 Table “Belonging to clusters”

On Fig. 10 clearly shows that cluster 3 includes 2 regions (Kaluga, Moscow) and Moscow, cluster 2 includes two regions (Bryansk, Voronezh, Ivanovo, Lipetsk, Oryol, Ryazan, Smolensk, Tambov, Tver), cluster 1 - Belgorod , Vladimir, Kostroma, Kursk, Tula, Yaroslavl.

Rice. 11 Dendrogram

3) analyze the main priorities of cash spending and savings in the resulting clusters;

To analyze the resulting clusters, we need to conduct a "Comparison of averages". The output window displays the following table (Fig. 12)

Rice. 12 Mean values ​​of variables

In the table "Average values" we can trace which structures are given the highest priority in the distribution of cash expenditures and savings of the population.

First of all, it should be noted that the highest priority in all areas is given to the purchase of goods and payment for services. The parameter takes a larger value in the 3rd cluster.

2nd place is occupied by the growth of financial assets. The highest value in 1 cluster.

The smallest coefficient in the 1st and 2nd clusters is for “acquisition of real estate”, and in the 3rd cluster a noticeable decrease in money in the hands of the population was revealed.

In general, the purchase of goods and services and the insignificant purchase of real estate are of particular importance for the population.

4) compare the resulting classification with the results of applying the intragroup relationship algorithm.

In the analysis of intergroup relationships, the situation practically did not change, with the exception of the Tambov region, which fell into 1 out of 2 clusters. (Fig. 13)

Rice. 13 Analysis of intra-group relationships

There were no changes in the "Averages" table.

Chapter 3. Factor Analysis

Task: Analysis of the activities of light industry enterprises.

Survey data are available for 20 light industry enterprises (Fig. 14) according to the following characteristics:

X1 - the level of capital productivity;

X2 – labor intensity of a unit of production;

X3 - the share of procurement materials in total costs;

X4 – equipment shift factor;

X5 - bonuses and remuneration per employee;

X6 - the proportion of losses from marriage;

X7 – average annual cost of fixed production assets;

X8 - the average annual wage fund;

X9 - the level of marketability of products;

· X10 – permanent asset index (ratio of fixed assets and other non-current assets to own funds);

X11 - turnover of working capital;

X12 - non-production costs.

Fig.14 Initial data

Required:

1. conduct a factor analysis of the following variables: 1,3,5-7, 9, 11,12, identify and interpret factor characteristics;

2. indicate the most prosperous and promising enterprises.

Performance:

1. Conduct a factor analysis of the following variables: 1,3,5-7, 9, 11,12, identify and interpret factor characteristics.

Factor analysis is a set of methods that, on the basis of real-life relationships of objects (features), make it possible to identify latent (implicit) generalizing characteristics of the organizational structure.

In the factor analysis dialog box, select our variables, specify the necessary parameters.

Rice. 15 Total explained variance

According to the table of "Total explained variance" it can be seen that 3 factors have been identified that explain 74.8% of the variations of the variables - the constructed model is quite good.

Now we interpret the factor signs according to the "Matrix of Rotated Components": (Fig.16).

Rice. 16 Matrix of rotated components

Factor 1 is most closely related to the level of product sales and has an inverse relationship with non-production costs.

Factor 2 is most closely related to the share of procurement materials in total costs and the share of losses from marriage and has an inverse relationship with bonuses and remuneration per employee.

Factor 3 is most closely related to the level of capital productivity and turnover of working capital and has an inverse relationship with the average annual cost of fixed assets.

2. Indicate the most prosperous and promising enterprises.

In order to identify the most prosperous enterprises, we will sort the data according to 3 factor criteria in descending order. (Fig.17)

The most prosperous enterprises should be considered: 13,4,5, since in general, according to 3 factors, their indicators occupy the highest and most stable positions.

Chapter 4. Discriminant Analysis

Assessment of the creditworthiness of legal entities in a commercial bank

The bank selected six indicators as significant indicators characterizing the financial condition of borrowing organizations (Table 4.1.1):

QR (X1) - quick liquidity ratio;

CR (X2) - current liquidity ratio;

EQ/TA (X3) - financial independence ratio;

TD/EQ (X4) - total liabilities to equity capital;

ROS (X5) - profitability of sales;

FAT (X6) - turnover of fixed assets.

Table 4.1.1. Initial data


Required:

Based on a discriminant analysis using the SPSS package, determine which of the four categories three borrowers (legal entities) wishing to obtain a loan from a commercial bank belong to:

§ Group 1 - with excellent financial performance;

§ Group 2 - with good financial performance;

§ Group 3 - with poor financial performance;

§ Group 4 - with very poor financial performance.

Based on the results of the calculation, construct discriminant functions; evaluate their significance by the Wilks coefficient (λ). Build a perception map and diagrams of the relative positions of observations in the space of three functions. Perform interpretation of the results of the analysis.

Progress:

In order to determine which of the four categories three borrowers who want to get a loan from a commercial bank belong to, we build a discriminant analysis that allows us to determine which of the previously identified populations (training samples) new customers should be assigned to.

As a dependent variable, we will choose a group to which the borrower may belong, depending on its financial performance. From the task data, each group is assigned a corresponding score of 1, 2, 3, and 4.

Unnormalized canonical coefficients of discriminant functions shown in Figs. 4.1.1 are used to construct the equation of the discriminant functions D1(X), D2(X) and D3(X):

3.) D3(X) =


1

(Constant)

Rice. 4.1.1. Coefficients of the canonical discriminant function

Rice. 4.1.2. Lambda Wilks

However, since the significance by the Wilks coefficient (Fig. 4.1.2) of the second and third functions is more than 0.001, it is not advisable to use them for discrimination.

The data of the table "Results of classification" (Fig. 4.1.3) indicate that for 100% of observations the classification was carried out correctly, high accuracy was achieved in all four groups (100%).

Rice. 4.1.3. Classification results

Information about the actual and predicted groups for each borrower is given in the table "Point Statistics" (Fig. 4.1.4).

As a result of discriminant analysis, it was determined with high probability that the bank's new borrowers belong to the training subset M1 - the first, second and third borrowers (serial numbers 41, 42, 43) are assigned to the M1 subset with the corresponding probabilities of 100%.

Observation number

Actual group

Most Likely Group

Predicted group

ungrouped

ungrouped

ungrouped

Rice. 4.1.4. Point statistics

The coordinates of centroids by groups are given in the table "Functions in group centroids" (Fig. 4.1.5). They are used to plot centroids on a perceptual map (Figure 4.1.6).

1

Rice. 4.1.5. Functions in group centroids

Rice. 4.1.6. Perception map for two discriminant functions D1(X) and D2(X) (* - group centroid)

The field of the "Territorial map" is divided by discriminant functions into four areas: on the left side there are mainly observations of the fourth group of borrowers with very poor financial performance, on the right side - the first group with excellent financial performance, in the middle and lower parts - the third and second groups of borrowers with bad and good financial performance, respectively.

Rice. 4.1.7. Scatterplot for all groups

On fig. 4.1.7 shows the combined schedule for the distribution of all groups of borrowers along with their centroids; it can be used to conduct a comparative visual analysis of the nature of the relative position of groups of bank borrowers in terms of financial indicators. On the right side of the graph are borrowers with high performance, on the left - with low, and in the middle - with average financial performance. Since, according to the calculation results, the second discriminant function D2(X) turned out to be insignificant, the differences in the centroid coordinates along this axis are insignificant.

Assessment of the creditworthiness of individuals in a commercial bank

The credit department of a commercial bank conducted a sample survey of 30 of its clients (individuals). Based on a preliminary analysis of the data, borrowers were evaluated according to six indicators (Table 4.2.1):

X1 - the borrower took a loan from commercial banks earlier;

X2 is the average monthly income of the borrower's family, thousand rubles;

X3 - term (period) of repayment of the loan, years;

X4 - the amount of the loan issued, thousand rubles;

X5 - composition of the borrower's family, persons;

X6 - age of the borrower, years.

At the same time, three groups of borrowers were identified according to the probability of loan repayment:

§ Group 1 - with a low probability of loan repayment;

§ Group 2 - with an average probability of loan repayment;

§ Group 3 - with a high probability of loan repayment.

Required:

Based on discriminant analysis using the SPSS package, it is necessary to classify three bank customers (according to the probability of loan repayment), i.e. assess whether each of them belongs to one of the three groups. Based on the results of the calculation, build significant discriminant functions, evaluate their significance by the Wilks coefficient (λ). In the space of two discriminant functions for each group, construct diagrams of the mutual arrangement of observations and a combined diagram. Assess the location of each borrower on these charts. Perform interpretation of the results of the analysis.

Table 4.2.1. Initial data

Progress:

To build a discriminant analysis, we choose the probability of timely repayment of a loan by a client as a dependent variable. Given that it can be low, medium and high, each category will be assigned a corresponding score of 1,2 and 3.

Unnormalized canonical coefficients of discriminant functions shown in Figs. 4.2.1 are used to construct the equation of the discriminant functions D1(X), D2(X):

2.) D2(X) =

Rice. 4.2.1. Coefficients of the canonical discriminant function

Rice. 4.2.2. Lambda Wilks

According to the Wilks coefficient (Fig. 4.2.2) for the second function, the significance is more than 0.001, therefore, it is not advisable to use it for discrimination.

The data of the table "Classification results" (Fig. 4.2.3) indicate that for 93.3% of observations the classification was carried out correctly, high accuracy was achieved in the first and second groups (100% and 91.7%), less accurate results were obtained in the third group (88.9%).

Rice. 4.2.3. Classification results

Information about the actual and predicted groups for each client is given in the table "Point statistics" (Fig. 4.2.4).

As a result of the discriminant analysis, it was determined with a high probability that the bank's new clients belong to the training subset M3 - the first, second and third clients (serial numbers 31, 32, 33) are assigned to the M3 subset with the corresponding probabilities of 99%, 99% and 100%.

Observation number

Actual group

Most Likely Group

Predicted group

ungrouped

ungrouped

ungrouped

Rice. 4.2.4. Point statistics

Probability of loan repayment

Rice. 4.2.5. Functions in group centroids

The coordinates of centroids by groups are given in the table "Functions in group centroids" (Fig. 4.2.5). They are used to plot centroids on a perceptual map (Figure 4.2.6).

The "Territorial Map" field is divided by discriminant functions into three areas: on the left side there are mainly observations of the first group of clients with a very low probability of repaying the loan, on the right side - the third group with high probability, in the middle - the second group of clients with an average probability of repaying the loan, respectively. .

On fig. 4.2.7 (a - c) reflects the location of the clients of each of the three groups on the plane of two discriminant functions D1(X) and D2(X). Based on these graphs, it is possible to conduct a detailed analysis of the probability of repaying a loan within each group, to judge the nature of the distribution of customers and to assess the degree of their remoteness from the corresponding centroid.

Rice. 4.2.6. Perception map for three discriminant functions D1(X) and D2(X) (* - group centroid)

Also in fig. 4.2.7 (d) in the same coordinate system, the combined graph of the distribution of all customer groups is shown along with their centroids; it can be used to conduct a comparative visual analysis of the nature of the relative position of groups of bank customers with different probabilities of loan repayment. On the left side of the graph are borrowers with a high probability of repaying the loan, on the right - with a low probability, and in the middle part - with an average probability. Since, according to the calculation results, the second discriminant function D2(X) turned out to be insignificant, the differences in the centroid coordinates along this axis are insignificant.

Rice. 4.2.7. Location of observations on the plane of two discriminant functions for groups with low (a), medium (b), high (c) probability of loan repayment and for all groups (d)

Bibliography

1. “Multivariate statistical analysis in economic problems. Computer modeling in SPSS”, 2009

2. Orlov A.I. "Applied statistics" M .: Publishing house "Exam", 2004

3. Fisher R.A. "Statistical Methods for Researchers", 1954

4. Kalinina V.N., Soloviev V.I. "Introduction to Multivariate Statistical Analysis" Textbook SUM, 2003;

5. Achim Buyul, Peter Zöfel, SPSS: The Art of Information Processing, DiaSoft Publishing, 2005;

6. http://ru.wikipedia.org/wiki

Liked the article? Share it