Experiments on the quantitative analysis of two-dimensional gels

Summary

Two-dimensional electrophoresis, as an important experimental tool for comparative proteomics studies, has been applied to conveniently reflect the differences in changes in the relative amounts of proteins in different physiological or genetic backgrounds. The statistical significance of the results is closely related to different factors, including everything from experimental design to statistical testing. The source of this experiment is the "Guide to Plant Proteomics Experiments" [France] H. Tillement, M. Zivi, C. Damerweil, V. Mitschine, eds.

Operation method

Quantitative analysis of two-dimensional gels

Materials and Instruments

open source software (OSS)

Move

3.1 Experimental design

1. Parallel design

Quantitative changes in some protein spots between different gels are uncontrollable. Biological differences between duplicate samples or any step from protein sample preparation to 2D gel staining are responsible for these uncontrollable variations. It has been shown that batch effects (i.e., changes in 2D gels from different series of simultaneous runs of electrophoresis and staining) have a significant impact on the results of 2D electrophoresis, so it is important to consider these effects in the experimental design. Multiple batches must be grouped into gels that differ only by the factor being analyzed, for example, comparing 12 treatments in the same electrophoresis tank (3 repetitions of each treatment), where 12 gels can be run simultaneously and 3 two-dimensional series should be run subsequently, with each treatment being represented on one series of one gel. Thus, once uncontrolled changes affect one of the 3 series (e.g., one silver stain is slightly darker than in the other), they will affect all treatments in the same way. In contrast, if 4 of the 3 replicates are operated on the same electrophoresis tank, it will not be possible to interpret the relationship between the batch effect and the true biological effect of these treatments. In reality, it is not always possible to set up exactly parallel designs, as failed 2D gels must be experimented with a second time. However, it is possible to balance the experiments as much as possible by supplementing the batches.

Notably, the amplification of batch effects can be visualized by running principal component analysis (with gels as observations and points as variables) after the fact.

2. Technical and biological replicates

Replicates are extremely important for checking quantitative variation. They take into account those uncontrollable variables that are essential for evaluating the effect of the factor under study.

When biological replicates are used (derived from different plant samples), differences between replicates are due to biological and experimental differences. Of course, when experimental replicates are used (e.g., different gels derived from the same protein sample), only the variables in the experiment are considered. Thus the differences produced by experimental replicates must be smaller than those produced in biological replicates, and thus more significant in statistical tests. However, the conclusions drawn from the experiment may only apply to the sample under study because the source of the variation is not well identified: it can be caused by differences in the individual plants (e.g. environmental and developmental variations), or by variations in the techniques used during the preparation of the sample, as well as by the factors tested. On the contrary, when biological replicates are applied, the factor studied is the only possible cause of significant differences, since it is the only one that produces differences between comparison and identification groups; therefore, biological replicates must be used when the treatment studied is a qualitative variable (e.g. drought versus control).

When the factor under study is a continuous variable (e.g., different doses of the same reagent, or the same treatment at different times) , it is not at all necessary to perform replicates for all the values tested (doses or time points): a statistical test (e.g., linear regression) will compute the residual variable on the basis of the difference between the observed value and that predicted value. However, replications can be omitted only if (i) there are sufficient data points (a regression line with only 3 points is insufficient); and (ii) the relationship between protein amount and dose or time is linear (linear regression), or the expected linear graph is already known (nonlinear regression).

It is possible in some cases to use homogeneity between consecutive data (e.g., 3 time points) to explain the results of experiments that have not been biologically replicated: the response at consecutive time points, its coherence or continuity supports the hypothesis that the variables are related to the treatment. In contrast, values that break the continuum of responses would be viewed as likely to be due to individual differences. Therefore, continuous values are actually used as biological replicates. In practice, however, this would result in a loss of resolution in the analysis of the treatment under study.

The best way to take biological variation into account in order to minimize individual differences is to take a few more plants in the biological replicates.

3. Reference gels

Reference gels are not always used even in quantitative analysis, but they are generally useful. In most 2D software packages, point matching is based on building a virtual reference gel containing all matching points. Building such a reference gel is a limiting step, and it is preferable to start with a real gel that already contains almost all points. This can be achieved by running co-electrophoresis in which different treatments are represented by equal amounts of proteins.

3.2 Image acquisition

1. digitization

Digitization of images is the first step in quantitative analysis. Gels can be scanned with a laser densitometer, flatbed scanner (e.g. Pharmacia) or CCD camera (e.g. ProXpress).

Whatever the system, the transmission value must be acquired first, i.e. the detector receives the light reflected from the gel. The transmission value is the ratio between the intensity of the signal received by the detector in the presence of the gel and the intensity of the signal received in the absence of the gel ( I/I0 ). Of course, when using a flatbed scanner, any features that enable contrast enhancement (e.g., gamma correction) should not be activated, as they will distort the true transmission value. Transmission values (ranging from 0 to 1) are generally encoded as 16 bit bits, i.e. the ratio of I/I0 is converted into values ranging from 0 to 65735. Therefore, the resulting image is a matrix of values (pixels) ranging from 0 to 65735. The TIF format is commonly used in grayscale mode and is compressed without pixel distortion as in the JPEG format.

2. Image Resolution

The higher the resolution (number of pixels per length unit), the better the dots will be detected and quantized. However, resolution is also a constraint in detecting groups of overlapping dots.

Most 2D software packages cannot detect multiple points in the same group if there are no "troughs" between intensity peaks. Therefore, the accuracy of point detection depends on the ability to detect these "troughs", which in turn depends on the number of pixels representing the intensity variable between the two points. For a gel of about 24 cm X 20 cm, a resolution of 100 μm per pixel is generally used, which approximates 300 dpi (84.7 μm per pixel). This value is actually a compromise because two factors limit the image resolution: (i) the speed of image acquisition: when many gels must be scanned on the same day, the time spent scanning the gels becomes the limiting factor; and (ii) the size of the image file: scanning a ~24 cm X 20 cm gel with a resolution of 100 μm/pixel and 16-bit digitization produces a file of 10 to 14 Mb. The larger the image file, the longer it takes for the 2D suite software to detect, quantify and match the points (see Note 1).

3. Dynamic images

During image acquisition, the 65736 gray levels should be used to the fullest extent possible. Depending on the type of scanner, the exposure time, aperture size, and the use of filters may be adjusted. Gel images should not contain white surfaces (100% transmission): if the background is not detected, dots on the background level may not be detected.

The image should also not contain black areas (0% transmission), as all points darker than the threshold will be encoded with the same value.

The freeware program ImageJ can be used to find possible areas with these saturated values. It can also check dynamic images, i.e. the difference between the minimum and maximum value on the image. Image dynamics should be maximized: The exact quantization depends of course on the gray level used in the image acquisition. It would undoubtedly be inappropriate to process an image of more than 65000 with a few hundred gray levels.

When using a camera, light can create a halo: i.e. the image is darker at the edges than in the center. Some image acquisition systems (e.g. ProXpress, Perkin) are designed with halation in mind. Of course, this phenomenon can also be eliminated by background removal.

4. Conversion of transmission data to optical density

Transmitted data must be converted to optical density (fluorescent staining cannot be converted, of course). This is not needed in most 2D data packages. Protein concentration correlates linearly with optical density, not transmission values. The relationship between optical density (OD) and transmission value is as follows: OD = - log ( I/I0 ).

Since this relationship is not linear, a given increase in transmission corresponds to a different increase in OD, which depends on the original value of the transmission value. It is sufficient to make the converted optical density linear with respect to the spot and protein amounts. If a particular protein is derived from two different samples, A and B, and B = A + X, then ODB = OD (A + X) = ODA + ODX, which results in an additive relationship of OD, which is incorrect for the transmission value, and the conversion must be done prior to background subtraction (Figure 16-1).

Generally, the conversion to OD can be done by scanning a Kodak strip. 2D software data includes a tool to record the known OD corresponding to the transmitted data and to calculate an adjustment curve. It is important to note that the conversion must account for the natural logarithmic relationship between the OD and the transmitted value, otherwise linear regression is not useful.

3.3 Homogenization of point quantities

As already discussed in 1., most of the variation in points is related to gel effects: possible protein precipitation during protein up-sampling, in 2D electrophoresis and uncontrolled changes during staining have the potential to affect the overall gel strength. Such variations affect more or less all points on a particular gel, so the purpose of homogenization is to correct these general differences. Therefore, homogenization should be performed after the protein spots have been converted to OD and the background eliminated (see Note 2).

1. Definition of protein spot homogenization

The region where spot detection is performed is determined by the user. In general, protein spot homogenization should be performed in the same region. It is important to define an identical region in all gels because this broad homogenization pattern is based on the total number of all protein spots in this region. Due to gel-gel variations (incomplete gels, regions that are difficult to define for various reasons), it is not always possible to determine the same region in all gels in which protein spots are detected. At this point, then, it is best to define another region for point homogenization. In fact, even if the same region of interest is defined for spot detection, it would be more practical to define a smaller region for the purpose of calculating the homogenization, and most of the variable regions on the gel can be discarded. The homogenization of the protein spot volume is based on the total number of spots in a user-defined region and can be simply programmed with a 2D software package. The data output from the 2D package includes X, Y, non-homogenized volume, and the number of matches optimized for each spot on each gel (i.e., the reference number). These data are easily exported in general 2D packages (e.g., by exporting the data from the measurement window of Progenesis).




" firstgel.csv" is a text file where the first row contains the column headings and the following rows contain the number of points, the number of matches, X, Y, and the amount of non-homogenization of all detected points on the first gel. Figure 16-2 shows a program written in the SAS language that normalizes a user-defined region based on the total number of all points in that region and produces a separate fixed table where each point is a variable (a column) and each gel is a row (an observation). Although not represented in Figure 16-2, this method is easily used in conjunction with another method. For example, the calculation of the total number of points can be restricted to points occurring on all gels (see Section 16.3.5). The calculation could also be limited to a specific set of points. However, the final number of points selected for normalization should not be too small: the smaller the number, the more unstable the normalization.

As shown in Figure 16-2, the program looks very complicated because of the many comments, but in fact they are relatively clear and simple. Just as there are different ways to analyze qualitative and quantitative changes, there are different ways to normalize. It can be edited with the most common statistical data packages in a way that is more convenient and advanced than the finite statistical tools that come with the 2D packages.




2. Normalization methods for other points

Another method of point normalization is based on the volume ratio of points on the same region of the reference gel as the studied gel: volrcf/volgel, to be calculated for all matching points on the reference gel and the studied gel. The homogenization consists of the product of the volumes of all points on the gel and the mean (or median) of the corresponding ratios. Since not all points are counted, the accuracy of the method does not depend on the precise definition of the homogenization region. Of course even if it is not very reproducible, we can still use the classical region of interest. Because it is based on matching points on two gels (i.e., on the homogenized gel and on the reference gel), homogenization involves more points than appear on all gels, since the number of points on a gel decreases significantly as the number of gels in the experiment increases.

This method is also theoretically superior to the normalization method for all points corresponding on the same region, because it has no bias for points specific to a treatment. It can improve its accuracy by calculating the volume ratio of points over a specific range. In fact, very weak points can be excluded because small volume changes can cause large ratio changes, and very large points should also be excluded because of the lack of a linear relationship between the near-saturation values of these large points.

Unmatched points are useless in this method, and only one table containing the raw volumes and number of matches for all matched points on all gels will be output. The program is shown in Figure 16-3, with data extracted from the CSV file from the "Compare Window" of the Progenesis software. The first column contains the names of the matches, and the next columns contain the values of the non-normalized quantities for the points on the different gels. It is worth noting that this way of quantitative data output is also the most convenient if one is satisfied with the normalization method proposed by the 2D software. The output file can also be used for the selection of reproducible points and qualitative change points (see 16. 3. 5).



Burstin et al [1] established another point homogenization method. It is based on principal component analysis and is suitable when the parameter variables under study are small relative to the residuals, or when this variation involves only a few points. It is not described further here.

3.4 Linear Relationship between Relative Intensity and Relative Volume

It is interesting to analyze the relationship between the protein (abundance) content and the volume of the measured points. One way of doing this is to compare a series of gels containing the same sample for protein content. However, in a true comparison, this does not give a correct quantitative estimate because the volumes of the points are normalized. In fact, one cannot normalize the amount of points from gels containing different contents of the same sample, due to the fact that normalization would eliminate the overall gel effect, which is caused by differences in protein upsampling.

A better approach is to use two different samples containing specific points (e.g., the sample of interest and another sample from a different species or organ) and to prepare mixtures with different ratios, e.g., from 1 : 9 to 9 : 1, but with the same amount of total protein. The two-dimensional gels obtained from these different mixtures and from the two pure samples can be homogenized as usual, and regressions specific to the points of the sample of interest can be computed, just as in the case of the known proportions of the individual samples in the mixture.Avid et al. used this method to investigate the linear relationship between the amount of points and the protein concentration under the same conditions as in normal experiments. If the response is linear, measurements can be made for not very significant differential protein concentrations.

3.5 Mass variation

A qualitative variable, i.e. the presence or absence of a point variable, is easier to determine than a quantitative variable. However, it can sometimes be more difficult to define, at least when dealing with large-scale experiments where a certain amount of data is missing.

Repeated points cannot be present in all gels, because it is by definition impossible to detect repeated qualitative points. Therefore, it is better to deal with consistency, taking into account the fact that protein spots can be consistently present or absent. The most stringent criterion for concordance is to consider that a protein spot must be present in all replicates in a given group (treatment group, genotype, etc.) and missing in all replicates in another "missing" group. However, this criterion is too strict for many gels because of possible accidents in the experiment (e.g., gel staining lighter than others, breakage of the gel, etc.).

3.6 Quantitative changes

The goals of quantitative proteomics analyses can vary widely, and the interest in a global analysis is to identify the main sources of protein changes and to identify the few protein spots corresponding to particular treatments.

Quantitative changes can be used to analyze relationships between proteins, such as determining the types of nuclear regulatory proteins. In general, hierarchical classification is used to cluster and visualize proteins according to their total amount under different experimental conditions. Visualization of the clusters can be done with the program "cluster".

Principal Component Analysis (PCA), which uses points as variables and samples as observations, can visualize the distribution of different samples according to the main variability of the protein points represented by the variables (see Chapter 17). PCA can also automatically detect abnormal gels, e.g., if the points are scattered on other gels in the gel under study (see Note 3).

Another approach to quantifying changes is to look for functional proteins that are clearly correlated with experimentally (e.g., treatment, genotype) controllable factors or other factors during the experiment (e.g., hormone dose). In global analyses like PCA, it is not desirable to detect significant change points exclusively because they do not necessarily need to be many and their changes are highly specific relative to changes at most points. When there is single or multifactorial significant variation in the detected points, the method of ANOVA is generally chosen. When more than two treatments are compared, it is better to use ANOVA at this point rather than running a direct t-test, as it allows for a better calculation of the residual variance estimator (see Note 4).






After ANOVA, different methods of comparison can be used depending on the specific biological question. For example, the Dunnet test is suitable for comparing different treatments under the same control condition, and the Duncan or Student-Newman-Keuls tests are suitable for comparing between all treatments (see Note 5). Linear regression is appropriate for calculating the relationship between the point and a continuous variable (e.g., dose of hormone). Figure 16-5 shows the SAS programming procedure for selecting all points that show significant changes in a two-way ANOVA and interactions between factors (see Note 6).

In general, 0.05 or 0.01 is the level of significance commonly used in statistical tests. That is, when there is a 5% or 1% probability of a change in the data, this change is considered significant. In other words, the significance level is the probability of a false positive test. Thus, if the 0.01 significance level is used for 1000 points, it is safe to assume that about 10 of them are false positives. As an alternative, consider splitting the significance level by the number of comparisons (Bonferroni correction). In the present case, this would result in a significance level of 10-5. The full probability of detecting a false positive in this way is 0.01 false positives in a 1000 point region. This is a conservative method, but it reduces the sensitivity because at a significance level of 10-5 the variation will be very large. Many true points may also be lost using this method.





In doing multiple comparisons, Benjamini and Hochberg [3] proposed the False Discovery Rate (FDR ) method. The principle of this method is to allow for a few percentage points (e.g. 5% or 1%) of the change in detection to be false positives. The Bonfenroni correction maintains the risk of 1/1000 false positives at 1%. The FDR method allows for a 1% error in a positive test, which is not very conservative, but is more sensitive than the Bonferroni correction. This is a compromise between no correction at all (all points tested at 1%) and "over" correction (Bonfemmi correction). Figure 16-6 shows the SAS program for selecting significant points according to the FDR method.




For more product details, please visit Aladdin Scientific website.

https://www.aladdinsci.com/

Categories: Protocols