REDUCE Suite

General Category => General Discussion => Topic started by: mora on May 24, 2019, 04:36:26 pm

Title: IC from PSAM
Post by: mora on May 24, 2019, 04:36:26 pm

What does the information content (IC) from logos built from a PSAM matrix mean? how does this differ from IC inferred from a classic PSSM/PWM matrix?

Also, is there any information about how LogoGenerator calculate the IC in bits from the values in the PSAM? I assume it has to be very different from how it is calculated according to classic PWM right? (see link)
https://en.wikipedia.org/wiki/Position_weight_matrix

Title: Re: IC from PSAM
Post by: xiangjun on May 25, 2019, 10:09:32 am

Hi Mora,

Thanks for your questions.

Regarding the notation PSAM, please refer to the two MatrixREDUCE-related publications:

"Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. (https://www.ncbi.nlm.nih.gov/pubmed/16317069)" Foat BC, Houshmandi SS, Olivas WM, Bussemaker HJ. Proc Natl Acad Sci U S A. 2005 Dec 6;102(49):17675-80. Epub 2005 Nov 29.
"Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE." Foat BC, Morozov AV, Bussemaker HJ. Bioinformatics. 2006 Jul 15;22(14):e141-9

You may simply take the PSAM logo as an alternative to the classic information content (IC) type. If you want to dig deep in technical details, please refer to the source code. The LogoGenerator program in the REDUCE Suite is just a tool for users to employ, as they see fit. Notably, it has more options just the IC or PSAM logos.

Hope this helps a bit.

Xiang-Jun

Title: Re: IC from PSAM
Post by: mora on May 26, 2019, 04:56:22 pm

Thanks a lot!! One more question:

How exactly logoGenerator calculates the height of each nucleotide at each position?

your Foat et al 2005 paper says that it "the height of each nucleotide is determined by subtracting the smallest weight for any nucleotide at that position and then dividing by the sum of all four weights".

I replicated this on a PSAM I made and the results were similar, but not identical, to the results obtained by Logo nucleotides (y-axis) height.

Does your LogoGenerator makes any other step other than subtracting the smallest weight? maybe some type of normalization?

Ohh I see in your 2006 papers says that for the affinity logo you used the average right? what about when you used the option -style=bits_info? How nucleotide height is calculated then to approximate bits?

Title: Re: IC from PSAM
Post by: xiangjun on May 27, 2019, 10:18:22 pm

Quote

How exactly logoGenerator calculates the height of each nucleotide at each position?

Check the source code. More specifically, the logo_psam() function in LogoGenerator.c. If you can go over a specific example, step-by-step, I'll be able to help.

As for the specifics in the two articles, Harmen may chime in to make a comment.

Best regards,

Xiang-Jun

Title: Re: IC from PSAM
Post by: mora on May 28, 2019, 04:45:27 pm

Thanks a lot but I do not know how to read C code.

I am posting an example of a matrix I created using Optimize PSAM and two logos from LogoGenerator. Basically what I want to know if how exactly LogoGenerator translates the PSAM values to nucleotides height in bits and in ddG.

For example: for position 3 G value is 1. How does Logo Generator turns that 1 into a ddD of ~ 3 and ~ 0.6 bits

Title: Re: IC from PSAM
Post by: hjb2004 on June 05, 2019, 08:52:57 am

The most pedagogical and detailed explanation of the relationship between the various logo and matrix representations can be found in a review that we wrote in 2007 (Bussemaker et al., Annu Rev Biophys Biomol Struct. 2007;36:329-47; https://www.ncbi.nlm.nih.gov/pubmed/17311525). I recommend that you read this paper, and some of the other relevant papers that it refers to.

A brief summary:

1. The philosophy of MatrixREDUCE is fundamentally different from that of traditional weigh matrix discovery methods such as MEME. We use a matrix representation of DNA binding specificity called position-specific affinity matrix (PSAM). The entries in the PSAM quantity the relative affinity of sequences that differ from the optimal DNA sequence by a single point mutation. Therefore, the largest value in each columns equals one, and all other values are between zero and one. The entries in a PSAM do not have to add up to one at each position.

2. The negative of the natural logarithm of the entries in the PSAM correspond to delta-delta-G values in units of RT, which are commonly used in the biophysical literature. The energy logo that we introduced in Foat et al. 2006 (https://www.ncbi.nlm.nih.gov/pubmed/16873464) is a graphical representation in which the letter heights are directly related to these ddG/RT values. This is the logo representation that we recommend for visualizing MatrixREDUCE models.

3. It is not possible to construct a traditional information-content logo from a PSAM without making further ad hoc assumptions about the "background frequency" of each base, and therefore we do not recommend it. Traditional PWM's are statistical models of sets of aligned binding sites, in which at each position the frequency of each base is specified. These frequencies do add up to one by definition, unlike in a PSAM. While we do not recommend this, you could divide each column of the PSAM by its sum to convert the four relative affinities for each position to obtain a set of four base frequencies. You could then convert these "foreground frequencies" to relative entropies as is done to construct traditional sequence logos (for each base, divide foreground frequency by background frequency, take the logarithm base two of this ratio, then multiply this logarithm by the foreground frequency for each base, and finally sum over all four bases to get the relative entropy, which is also known as the information gain, measured in bits). The total height of the letter stack in the logo will correspond to this relative entropy, and the height of the individual letters will be proportional to the corresponding base frequency.

Finally, here is a numerical example for how a single position within the binding site could be represented:

	A	C	G	T
relative affinity	1.0	0.5	0.1	0.9
ddG/RT	0.00	0.69	2.30	0.11
base frequency (not recommended)	0.40	0.20	0.04	0.36

Hope this helps!