Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - hjb2004

Pages: [1]

General Discussion / Re: IC from PSAM

« on: June 05, 2019, 08:52:57 am »

The most pedagogical and detailed explanation of the relationship between the various logo and matrix representations can be found in a review that we wrote in 2007 (Bussemaker et al., Annu Rev Biophys Biomol Struct. 2007;36:329-47; https://www.ncbi.nlm.nih.gov/pubmed/17311525). I recommend that you read this paper, and some of the other relevant papers that it refers to.

A brief summary:

1. The philosophy of MatrixREDUCE is fundamentally different from that of traditional weigh matrix discovery methods such as MEME. We use a matrix representation of DNA binding specificity called position-specific affinity matrix (PSAM). The entries in the PSAM quantity the relative affinity of sequences that differ from the optimal DNA sequence by a single point mutation. Therefore, the largest value in each columns equals one, and all other values are between zero and one. The entries in a PSAM do not have to add up to one at each position.

2. The negative of the natural logarithm of the entries in the PSAM correspond to delta-delta-G values in units of RT, which are commonly used in the biophysical literature. The energy logo that we introduced in Foat et al. 2006 (https://www.ncbi.nlm.nih.gov/pubmed/16873464) is a graphical representation in which the letter heights are directly related to these ddG/RT values. This is the logo representation that we recommend for visualizing MatrixREDUCE models.

3. It is not possible to construct a traditional information-content logo from a PSAM without making further ad hoc assumptions about the "background frequency" of each base, and therefore we do not recommend it. Traditional PWM's are statistical models of sets of aligned binding sites, in which at each position the frequency of each base is specified. These frequencies do add up to one by definition, unlike in a PSAM. While we do not recommend this, you could divide each column of the PSAM by its sum to convert the four relative affinities for each position to obtain a set of four base frequencies. You could then convert these "foreground frequencies" to relative entropies as is done to construct traditional sequence logos (for each base, divide foreground frequency by background frequency, take the logarithm base two of this ratio, then multiply this logarithm by the foreground frequency for each base, and finally sum over all four bases to get the relative entropy, which is also known as the information gain, measured in bits). The total height of the letter stack in the logo will correspond to this relative entropy, and the height of the individual letters will be proportional to the corresponding base frequency.

Finally, here is a numerical example for how a single position within the binding site could be represented:

	A	C	G	T
relative affinity	1.0	0.5	0.1	0.9
ddG/RT	0.00	0.69	2.30	0.11
base frequency (not recommended)	0.40	0.20	0.04	0.36

Hope this helps!

General Discussion / Re: Multicollinearity

« on: September 10, 2018, 01:55:26 pm »

Dear Anthony,

Following up on Xiang-Jun's reply, you are correct that the Transfactivity program does not explicitly deal with collinearity. This should not be a problem when Transfactivity is used to infer TF activities for additional expression profiles using one more PSAMs generated by MatrixREDUCE, as the stepwise PSAM discovery implemented by MatrixREDUCE was explicitly designed to make the PSAMs distinct from each other. In other words, when AffinityProfile is used with a set of PSAMs discovered by MatrixREDUCE to create a matrix containing total affinities for each sequence (which is also the first step performed by Transfactivity), the columns of that matrix will be close to orthogonal. The value of the regression coefficients in a multi-PSAM linear regression will then be close those obtained in separate single-PSAM fits.

Things are potentially different, however, when Transfactivity is used with a set of PSAMs obtained from another source such as Jaspar. In that case, there is no guarantee that the columns of the affinity matrix created by AffinityProfile are independent of each other, and the behavior of the regression could indeed become unstable due to collinearity. We were dealing with exactly this situation in two of our lab’s previous papers. In one case, we implemented L2-penalized regression in R with a design matrix generated by AffinityProfile to deal with collinearity when inferring protein-level activities for a large number of yeast transcription factors (Lee et al., Mol Syst Biol 2010; https://www.ncbi.nlm.nih.gov/pubmed/20865005). In the second case, when we were doing the same for human transcription factors based on a collection of PWMs from Jaspar, we did some additional preprocessing on the design matrix in R as well (Lee et al., PNAS 2014; https://www.ncbi.nlm.nih.gov/pubmed/24706889; see supplemental methods).

I hope this is useful.

Best regards,
Harmen

General Discussion / Re: availability of FeatureREDUCE?

« on: July 18, 2018, 04:34:12 pm »

Hello PK,

The latest/only version of FeatureREDUCE is indeed still the 2015 version on GitHub:
https://github.com/FeatureREDUCE/FeatureREDUCE

You may also be interested in No Read Left Behind (NRLB), the latest algorithm from our lab:
https://github.com/BussemakerLab/NRLB
https://www.ncbi.nlm.nih.gov/pubmed/29610332

Best regards,
Harmen Bussemaker

General Discussion / Re: Error when generating logos in PDF format

« on: November 15, 2017, 06:56:27 pm »

Hi Xiang-Jun,

PDF support seems important to keep, and removing the option "-dTextAlphaBits=4" from config/pkg_settings.cfg completely solved the problem for me for this format, so I can continue with what I was doing now.

However, I agree that it will be best to discontinue GIF support. On my Mac at least, "convert" is not installed by default, and indeed GIF generation with "-format=gif" does not work:

$ LogoGenerator -file=results/psam_001.xml -format=gif
sh: line 6: convert: command not found
system('gs -sOutputFile=- \
   -sDEVICE=png16m \
   -dDEVICEWIDTHPOINTS=340 \
   -dDEVICEHEIGHTPOINTS=213 \
   -q -r96 -dSAFER -dBATCH -dNOPAUSE \
   ./temp_logo.eps \
   | convert png:- ./psam_001.gif
') returns nonzero (32512)

Thanks for the quick response!

Harmen

General Discussion / Error when generating logos in PDF format

« on: November 15, 2017, 05:36:00 pm »

Hi Xiang-Jun,

When I run these commands:

cd examples/MatrixREDUCE/spellman-alpha
sh commandline.sh
LogoGenerator -file=results/psam_001.xml -format=pdf

I get the following error:

GPL Ghostscript 9.19:

ERROR:
Can't set GraphicsAlphaBits or TextAlphaBits with a vector device.
Unrecoverable error: rangecheck in .putdeviceprops
system('gs -sOutputFile=./psam_001.pdf \
   -sDEVICE=pdfwrite \
   -dPDFSETTINGS=/printer \
   -dEmbedAllFonts=true \
   -dDEVICEWIDTHPOINTS=340 \
   -dDEVICEHEIGHTPOINTS=213 \
   -q -r96 -dTextAlphaBits=4 -dSAFER -dBATCH -dNOPAUSE \
   ./temp_logo.eps
') returns nonzero (65280)

The error seems to be triggered by the "-dTextAlphaBits=4" option. When I run "gs" manually without it, it work fine.

Seems that this is a known problem with more recent versions of "gs". I use version 9.19 on a Mac OS 10.11.6.

Thanks,
Harmen