1

**General Discussion / What is the proper input and reporting score for Transfactivity?**

« **on:**March 11, 2020, 09:55:13 pm »

Hello,

I have been using transfactivity on my data and am very pleased with the results! It looks very coherent and seems to be giving the "right" answer. I am close to publishing the results, but wanted to run my current process by you and get your opinion regarding whether this is valid given the statistical inferences used in Transfactivity.

I calculated effect sizes for my RNAseq dataset using Sleuth (same idea as DESeq2 or EdgeR), and fed the raw effect sizes into Transfactivity to make TF motif activity inferences. I subsetted the results on motifs that passed an arbitrary significance threshold in at least 1 sample, then reported the actual coefficients (f value) of the significant motifs. Since the raw coefficients vary so much in value, I rescaled them by dividing by the largest coefficient in the row. Therefore, at least 1 sample will have a value of either -1 or +1 in each row, and everything else is relative to that.

You can find an illustrated example of this process at this link: https://drive.google.com/open?id=1T6wy3ho5nml7f5tq83u0UDvmsVPw4DF3

However, there are many alternative ways to input the data and ways to report it, and I was just hoping to get your opinions on the options.

Input:

1. Input the effect sizes for each gene as I did above. The problem here is many of the largest effect sizes are not actually significant (usually lowly expressed genes), and this will throw off the TF activity inferences.

2. Input the effect sizes only for genes that passed significance. I presume this will throw off the predictions though if I only feed it data for ~500 genes.

3. Input the effect sizes, but use some arbitrary process to get rid of the signal from non-significance genes, such as discarding only the ~200 genes with large effect sizes but no significant p-value, or alternatively just setting the effect size to 0 if it did not pass significance.

4. Input the p-values directly (signed -log10). I tried this and it works decently well, but obviously Transfactivity is expecting to predict magnitude of gene expression change.... not its significance which can vary wildly even just based on things like # of replicates I used.... so this seems wrong.

5. Input the row-normalized TPM matrix directly (or normalized count matrix), and then, to figure out which motifs associate with my statistical covariates, feed the Transfactivity coefficients into another regression to predict those that track with my covariates.

Which one would you recommend?

Reporting the output:

1. Report row-normalized coefficients as I did.

2. Report the signed -log p-values. As you can see in the pdf example, some motifs are MUCH more significant than others in the results, and this distinction is lost using the coefficients.

Thanks again for writing and maintaining such a great tool, hope to hear your opinions.

I have been using transfactivity on my data and am very pleased with the results! It looks very coherent and seems to be giving the "right" answer. I am close to publishing the results, but wanted to run my current process by you and get your opinion regarding whether this is valid given the statistical inferences used in Transfactivity.

I calculated effect sizes for my RNAseq dataset using Sleuth (same idea as DESeq2 or EdgeR), and fed the raw effect sizes into Transfactivity to make TF motif activity inferences. I subsetted the results on motifs that passed an arbitrary significance threshold in at least 1 sample, then reported the actual coefficients (f value) of the significant motifs. Since the raw coefficients vary so much in value, I rescaled them by dividing by the largest coefficient in the row. Therefore, at least 1 sample will have a value of either -1 or +1 in each row, and everything else is relative to that.

You can find an illustrated example of this process at this link: https://drive.google.com/open?id=1T6wy3ho5nml7f5tq83u0UDvmsVPw4DF3

However, there are many alternative ways to input the data and ways to report it, and I was just hoping to get your opinions on the options.

Input:

1. Input the effect sizes for each gene as I did above. The problem here is many of the largest effect sizes are not actually significant (usually lowly expressed genes), and this will throw off the TF activity inferences.

2. Input the effect sizes only for genes that passed significance. I presume this will throw off the predictions though if I only feed it data for ~500 genes.

3. Input the effect sizes, but use some arbitrary process to get rid of the signal from non-significance genes, such as discarding only the ~200 genes with large effect sizes but no significant p-value, or alternatively just setting the effect size to 0 if it did not pass significance.

4. Input the p-values directly (signed -log10). I tried this and it works decently well, but obviously Transfactivity is expecting to predict magnitude of gene expression change.... not its significance which can vary wildly even just based on things like # of replicates I used.... so this seems wrong.

5. Input the row-normalized TPM matrix directly (or normalized count matrix), and then, to figure out which motifs associate with my statistical covariates, feed the Transfactivity coefficients into another regression to predict those that track with my covariates.

Which one would you recommend?

Reporting the output:

1. Report row-normalized coefficients as I did.

2. Report the signed -log p-values. As you can see in the pdf example, some motifs are MUCH more significant than others in the results, and this distinction is lost using the coefficients.

Thanks again for writing and maintaining such a great tool, hope to hear your opinions.