Experiments in multivariate analysis of proteomic data
We provide background on multivariate analysis of proteomics data and also provide methods for data transformation of different data packages. This technique can be used to solve problems in biology and biochemistry that contain large amounts of data. Multivariate analysis involves digitization of 2D gels, analysis by image processing software, data conversion, multivariate data analysis, interpretation of results, and ultimately solving biological problems. The source for this experiment is the "Guide to Plant Proteomics Experiments" [France] H. Tillemment, M. Zivi, C. Damerweiler, V. Mitschine, eds.
Operation method
Multivariate analysis of proteomic data
Materials and Instruments
Scanners Image analysis software Excel programs Software for multivariate data analysis Move Multivariate analysis of 2D gels was performed using Progenesis, Excel and The Unscrambler. For more product details, please visit Aladdin Scientific website.
3.1 2D gels of proteins were created after determining the study protocols
This will not be elaborated in this section, but the staining method will be determined for quantitative analysis of the gel (see Note 1 ).
3.2 Digitize the gel using a scanner with transmission mode scanning is not described in this section, but make sure that the images are scanned with high pigmentation and resolution (see Note 2 ) and that the images are saved in the correct format in the image processing software (see Note 3 ).
3.3 Analysis of Digitized 2D Gel with Data Analysis Software After digitizing the 2D gel (Fig. 17-1), use the image analysis software Progenesis to identify the protein spots and match them with those on the reference gel for analysis. A reference gel can be selected automatically, or a specific 2D gel can be selected as the reference gel, and protein spots that cannot be matched can be returned to the reference gel.
3.4 Generate protein spot list
After protein spot detection, a list of matching protein values can be generated, usually this is volumetric data. This list can be found in the Comparison Window of the Progenesis software. This list (Table 17-1) can also be exported to Excel using Copy to Excel in the Edit menu. A 1 indicates the presence of a protein site, and a 0 indicates the absence of a protein site. This binary list can be very useful in certain situations (see Note 4). It is also very important that the protein sites are labeled at the same time (see Note 5), otherwise the data entry into Excel will be problematic in one way or another (see Note 6).
3.5 Entering the tabular data into multivariate analysis software for analysis
1. Validation method
The next step is the selection of the validation method, which is based on the number of samples and the possibility of creating another data set. If the data set consists of many gels, the preferred validation method is the test set method, followed by the cross-validation method.
( 1 ) Test set validation is based on two different sets of data, one used for PCA calibration (calibration set) and model creation, and the other used to test the computational model for PCA calibration (test set/validation set).
The test set requires several conditions. First, like the calibration set, all samples must be from the same population, and the sampling conditions must be the same as for the calibration set. In addition, both datasets must be representative. Since the two datasets are likely to be very similar, a large dataset cannot simply be divided into two datasets. Only different parts of the two datasets can be used as sampling variance, i.e. the variance of independent samples originating from the same target group[6] . The calibration set must be large enough in order to calibrate a model and the test set must be large enough in order to test the model. Often we do not have enough samples to do test set testing, so it becomes necessary to do leveraged validation or cross-validation.
( 2 ) Leverage validation can be done when there are few samples but they are all important. Since leverage validation is used to test the entire dataset and later to validate the correct dataset for testing, leverage validation usually yields good results. However, we do not recommend the use of leverage validation.
( 3 ) Cross validation is used for medium to large datasets. The dataset is divided and each region is omitted, sub-models (datasets that do not contain regions) are used to perform the calibration and the divided regions are used to test the model. Each region must operate this way . Region size and structure (random, systematic, manual) will vary depending on the type of dataset. Each region represents 25% of the total dataset, which means that there are 4 sub-models to be computed and tested. For smaller datasets, regions with only one sample are often used, this is called full cross validation which means that one sample is omitted from the calibration on time and the omitted sample is used for testing. Full cross validation is the process of constructing as many sub-models as there are samples. Because only one sample can be omitted at a time, and each omitted sample is used to test the model, full cross validation often yields a good validation result when testing a balanced data set [6] .
( 4 ) In Unscrambler, variable options can be selected (see note 9).
Figure 17-2 Example PCA analysis.
3.7 Score Description and Score Plots
( 1 ) The principal components (PCs) are linearly related to the original variables and contain information about the structure of the data. The first principal component covers most of the information, and the more advanced principal components cover less information, PC is also called latent variable or score vector.
( 2 ) The score plot is a plot of the positions of the samples of two or three principal components, so the more similar the samples are, the closer the scores are. At the beginning, people used clusters to interpret Score Plot, that is, samples with common characteristics are a cluster, so that we can get information about the samples and the variables that differentiate them from the samples. In addition, it is possible to find outlier samples, i.e., samples that are different from the majority of the samples. Since outlier samples may be samples of interest to us, and may also help us identify errors in analysis or errors in data collection (i.e., data that can be eliminated), it is still necessary for us to analyze outlier samples.
( 3 ) The Score Plot should be analyzed along with the information in the Loading Plot for the same principal components, which can help us identify the variables that are the differences in the samples, which can be observed in the Score Plot.The Loading Plot ( Figure 17-3 ) depicts the data from different perspectives. Each variable has a PC value, which reflects not only how many variables contribute to the PC, but also how much the PC takes into account changes in the variables.
( 4 ) Elaborating the Loading Plot Starting with variables with high scores can help analyze the significance of a particular PC ( Figure 17-4 ) . Again the higher the score the higher the correlation between the two variables. Since Loading is the cosine of the angle between the variable and the PC, this value is any value between [ -1: + 1 ]. Variables with high scores are positively correlated on the same side and negatively correlated on the opposite side. To aid in the analysis, a biscale plot can be made, which is a scatter plot of Scores and Loadings (Figure 17-5).
3.8 Regression Biology Analysis
Once the sample distribution has been elucidated with the Scores plot and the variables (that lead to the sample distribution) with the Loading plot, it is time to return to the biology or biochemistry of the sample. In 2D gel electrophoresis, the protein spots are the variables, which means that this analysis can indicate which protein spots are responsible for some of the distribution of the 2D gel (Figure 17-6). The researcher can then formulate a hypothesis to explain this distribution. This is also called exploratory dataanalysis, and it is a very effective method of proteomic analysis.
3.9 Analyzing the Corresponding Variables Associated with Protein Samples by Partial Squares Regression
Partial least squares regression (PLSR) is a supervised method that correlates two sets of data matrices (X and Y) through regression.PLSR is similar to PCA in that it uses data points in a multidimensional space to find the most direct linear relationship that explains most of the variability.The goal of PLSR is to predict the desirable characteristics of another data table that requires a linear model to be constructed from one data table. The goal of PLSR is to find the most direct linear relationship that explains most of the variation by finding the most direct linear relationship between data points in a multidimensional space. Thus, PCA is used to find the hidden information in a data table (X matrix), and PLSR is used to detect the relationship between two data tables (X matrix and Y matrix), the X matrix is (N X K), and the Y matrix is (N X J ), with N being the sample, and K, J, X, and Y being the variables, respectively [5].
PLS-R works by manipulating the X and Y matrices of PCA, which are interdependent.
Similarly to PCA [ 5 ], the X-variables are related to the X-scores T according to the X-loading P and the X-residue E by means of the model in question [ 5 ]:
Similarly, the variables are related to the X-score T through the model according to the y-loading Q and the Y-residue F [5].
The variables can be obtained directly from the X-variables by means of the regression coefficient matrix B [5].
The results can be analyzed by combining the above series of equations.
Calibration of PLS-R
The significance of the PLS-R calibration is expressed in two parameters: the residual value calibration variable (RVYV ), which indicates the difference between the measured and predicted Y-variables. The difference across models can be expressed as the root mean square of RVYV (RMSE ).
The X-data is used to build the model, and the X-values are inserted into the model to predict the Y-values. The modeling error is the difference between the Y-prediction and the Y-data [6] .
The deviation is an expression of how similar the predicted sample is to the calibration sample. When the predicted sample is similar to the calibration sample, the deviation is small. With high deviation, the predicted value is not credible.
The last important parameter is the correlation coefficient (r ) as defining the correlation between X and Y. The correlation formula is as follows:
Correlation is a measure of the linear relationship between two variables. A value of 1 indicates the existence of a linear relationship between the variables, while a value of 0 means that there is no linear correlation between the variables.