Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - xiangjun

Pages: [1] 2
1
General Discussion / Re: Error when generating logos in PDF format
« on: February 11, 2018, 09:53:30 am »
Quote
I can't directly read LogoGenerator-created EPS file on Linux. Am I missing something?

Could you be more specific? Please provide a concrete example so I (and others) can reproduce what you failed to achieve.

Best regards,

Xiang-Jun

2
General Discussion / Re: Error converting PWM to PSAM
« on: December 19, 2017, 10:42:12 pm »
Hi Jason,

I've updated the REDUCE Suite to version 2.2.5-2017dec19 in which Convert2PSAM has an additional source option of PFM. An example run on YDR146C (you posted) is shown below:

Code: [Select]
Convert2PSAM -source=pf -inp=$REDUCE_SUITE/data/formats/pfm_YDR146C.dat -psam=stdout
Please have a try, and let me know if you've further problems.

Xiang-Jun

3
General Discussion / Re: Error converting PWM to PSAM
« on: December 19, 2017, 01:42:50 pm »
The PFM is row-wise, while the PWM format accepted by Convert2PSAM is column-wise, in order of A, C, G, and T. See $REDUCE_SUITE/data/formats/pwm_ex.dat for an example.

I'll revise Convert2PSAM to accept the PFM format, so you do not need to do extra work.

Xiang-Jun

4
General Discussion / Re: Error converting PWM to PSAM
« on: December 19, 2017, 01:12:49 pm »
Hi Jason,

I've looked into the issue. As expected, it is indeed yet another PWM variant that need special attention to be converted to PSAM.

One example (YDR146C_569.pwm) from Expert_PWMs.tar.gz is as below:

Code: [Select]
A 0.381537584575116 1.07668300283655 -800 -800 1.68965987938785 -800 -800 0.989220160345073
T -0.31034012061215 -3.01077985606558 -800 -800 -800 -800 -800 -2.01077983731055
G 0.395928676331139 -2.30451105912229 -800 -800 -800 2.39592867633114 -800 -1.30451104036726
C -0.982582949230903 0.502843879011055 2.39592867633114 2.39592867633114 -800 -800 2.39592867633114 0.280451460353898

It has negative values, including a presumably cutoff value of -800. On the other hand, entries in PSAM should all be positive. So we need a way to convert the negative values to positive ones.

In a similar case, the TAMO (as in the MacIsaac dataset) format distributed with the REDUCE Suite looks like the following:

Code: [Select]
Log-odds matrix for Motif   0 rGAA..TtctrGAA (0)
#        0         1         2         3         4         5         6         7         8         9        10        11        12        13
#A     0.743    -1.052     1.647     1.443    -1.558    -0.374    -3.255    -5.001    -0.793    -2.480     1.175    -3.678     1.635     1.629
#C    -1.105   -10.336    -8.324    -3.641     0.691     0.311    -1.463    -0.208     1.931    -1.053    -2.000   -10.641    -2.819    -4.350
#T    -3.868    -3.114    -4.032    -2.297     0.288    -0.426     1.428     1.320    -2.576     1.393    -5.066    -3.566    -5.030    -3.764
#G     0.967     2.100    -4.305    -1.267     0.140     0.632    -1.563    -1.879    -2.088    -2.285     0.357     2.321    -8.368    -4.399

Here, Convert2PSAM performs a 2**score transformation so that the scores become positive.

Should we take a similar transformation for the Expert_PWMs.tar.gz data? Harmen, what's your take?

Please let me know your opinions.

Xiang-Jun


PS. For the record, it is worth noting that the $REDUCE_SUITE/data/formats/ folder contains several other commonly used PWM-like files that can be handled by Convert2PSAM.

Code: [Select]
#transfac.dat
ID any_old_name_for_motif_1
BF species_name_for_motif_1
P0      A      C      G      T
01      1      2      2      0      S
02      2      1      2      0      R
03      3      0      1      1      A
04      0      5      0      0      C
05      5      0      0      0      A
06      0      0      4      1      G
07      0      1      4      0      G
08      0      0      0      5      T
09      0      0      5      0      G
10      0      1      2      2      K
11      0      2      0      3      Y
12      1      0      3      1      G
XX
//

Code: [Select]
#jaspar_ex1.dat
 1  6  1  0 13  0  6  0 13 15  2  5
 4  0  0  0  1 15  0  9  4  0  3  5
 8 12  0  3  2  1 12  0  1  1  1  3
 5  0 17 15  2  2  0  9  0  2 12  5

Code: [Select]
#jaspar_ex2.dat
A  [ 1  6  1  0 13  0  6  0 13 15  2  5 ]
C  [ 4  0  0  0  1 15  0  9  4  0  3  5 ]
G  [ 8 12  0  3  2  1 12  0  1  1  1  3 ]
T  [ 5  0 17 15  2  2  0  9  0  2 12  5 ]

The Convert2PSAM has been created explicitly for such real-world wild cases.

5
General Discussion / Re: Error converting PWM to PSAM
« on: December 18, 2017, 02:14:05 pm »
Hi Jason,

Thanks for using the REDUCE Suite and for posting on the Forum.

The error message seems to hint a PWM format variant that Convert2PSAM cannot handle. I'll look into the details, and revise Convert2PSAM as necessary. I'll post back on the Forum, probably by tomorrow.

Best regards,

Xiang-Jun

6
General Discussion / Re: Error when generating logos in PDF format
« on: November 15, 2017, 11:09:40 pm »
As a followup, the REDUCE Suite has been updated to v2.2.4-2017nov16. The LogoGenerator bug for PDF output has been fixed. The obsolete GIF output has been removed to avoid a dependency on the convert program from ImageMagick. The default PNG format is the choice for use with HTMLSummary-generated webpage. The LogoGenerator documentation has been also revised.

Some examples:

Code: Ruby
  1. # By default, the output is in PNG format
  2. LogoGenerator -file=$REDUCE_SUITE/data/formats/psam_ex.dat -logo=sample.png
  3. # Using the -format=pdf option for PDF output
  4. LogoGenerator -file=$REDUCE_SUITE/data/formats/psam_ex.dat -logo=sample.pdf -format=pdf
  5. # Output in the raw EPS format with -format=eps
  6. LogoGenerator -file=$REDUCE_SUITE/data/formats/psam_ex.dat -logo=sample.eps -format=eps

The LogoGenerator utility in the REDUCE Suite is a general purpose, robust logo generator of DNA or RNA base sequences. It creates a logo in the vector EPS format, which can be easily converted to other vector or raster image format using numerous third-party tools. Internally, LogoGenerator takes advantage of the widely available gs program (Ghostscript).

It is worth noting that on Mac OS X, the preview program can directly read LogoGenerator-created EPS file and convert it to PDF format. On Linux and Windows, the situation should be similar.

Xiang-Jun

7
General Discussion / Re: Error when generating logos in PDF format
« on: November 15, 2017, 07:04:41 pm »
Hi Harmen,

Thanks for your quick feedback.

I'll update the software code with 'gif' output removed, but keep the PDF option. A new release will be made available on the download page late tonight.

Best regards,

Xiang-Jun


8
General Discussion / Re: Error when generating logos in PDF format
« on: November 15, 2017, 06:07:43 pm »
Hi Harmen,

Thanks for posting on the Forum!

Yes, I can reproduce the error message with regard to generating the logos in PDF format. It is indeed due to the Ghostscript "-dTextAlphaBits=4" option you reported. I am using Ghostscript 9.21.

I remember taking the "-dTextAlphaBits=4" option from reading on the docs/examples somewhere. Now that we know the problem, we have the following options to go:

  • Simply remove the "-dTextAlphaBits=4" option from the system call.
  • Or we can remove the support of the PDF output format (from the documentation).

While we are here, I'd also want to remove the largely out-of-date GIF output format. By doing so, we also get rid of the dependency on the convert from ImageMagick.

What's your take? Please let me know, and I will update and code for a new release late tonight (or tomorrow).

Xiang-Jun

9
Documentation / Re: Set up the REDUCE Suite
« on: June 19, 2017, 04:39:58 pm »
Hi Rahul,

Thanks for your feedback. Step #5 should work as is if step #4 has been performed as advertised, which adds the bin/ directory to PATH. I've slightly refined the instruction for step #4 to make it clearer.

Executing 'bin/MatrixREDUCE -h' assumes one is at the $REDUCE_SUITE root directory.

Xiang-Jun

10
General Discussion / Re: Affinity score calculation
« on: January 05, 2017, 01:30:05 pm »
Hi,

Thanks for using the REDUCE Suite and for posting your question(s) on the Forum.

The concept of affinity in the REDUCE Suite is simple, but technical. As is often the case, the idea can be best illustrated with a concrete example.

Let's suppose we have a PSAM (sample-psam.xml) as shown below:

Code: [Select]
<matrix_reduce>

<directionality>forward</directionality>
<psam_length>6</psam_length>

<psam>
# A            C            G            T
# +============+============+============+=======
  1            0.25         0.1          0.1   #1
  0.1          0.5          0.2          1     #2
  0.1          1            0.1          0.1   #3
  1            0.1          0.6          0.1   #4
  0.2          0.6          1            0.1   #5
  0.1          0.1          1            0.3   #6
</psam>
</matrix_reduce>

And a short base sequence (sample-seq.txt) as below:

Code: [Select]
>sample
GTCATGGT

Since the PSAM has a length of 6, and the single sequence has 8 bases, there are three sliding windows, as detailed below:

Code: [Select]
w1: GTCATG      --- affinity of w1: 0.1 * 1 * 1 * 0.1 * 1 * 1 = 0.01
w2:   TCATGG    --- affinity of w2: 0.1 * 0.5 * 0.1 * 0.1 * 1 * 1 = 0.0005
w3:     CATGGT  --- affinity of w3: 0.2 * 0.1 * 0.1 * 0.6 * 1 * 0.3 = 0.00045
---- sum of affinity = 0.01095

If you run:
Code: [Select]
AffinityProfile -seq=sample-seq.txt -psam=sample-psam.xml
you will find the following content in the default output file seq_psam.dat:
Code: [Select]
        sample-psam.xml
sample  0.01095

There are quite a few variations for the calculation of affinity in AffinityProfile, but the above example covers the essence. Since the REDUCE Suite is open source, you can and are encouraged to dive into the details.

Hope this helps,

Xiang-Jun


11
General Discussion / Re: What do the bases mean in PSAM?
« on: December 01, 2016, 10:56:56 am »
Dear Pan Shen,

Thanks for using the REDUCE Suite and for asking your questions on the Forum.

The W in the converted PSAM notation means A or T (Weak, since the A-T Watson-Crick pair has two H-bonds, compared to three in a G-C pair). Not surprising, S (for Strong) represents G or C.

More details on "Nucleic acid notation" can be found in the Wikipedia, among many other online resources.

Hope this helps.

Xiang-Jun

12
Documentation / Other utility programs
« on: September 29, 2016, 01:27:04 pm »
The REDUCE Suite distribution also includes the following auxiliary programs. Simple type the corresponding program name with -h (e.g., Convert2PSAM -h) should provide sufficient information to get one started.

Convert2PSAM
As its name suggests, Convert2PSAM is a utility program that converts other commonly used motif (pattern) representations in nucleic acid sequences to PSAM, which is unique to the REDUCE Suite. It can also be used to standardize the various formats to a simplified PWM format for easy communication.

Topo2Dictfile
The default topological pattern mechanism can be used to specify sequence motifs in a compact, convenient, and flexible way. However, it defines the motifs implicitly, has length limit (15 non-gap positions), and does not take into consideration of the IUPAC degenerate symbols. As an example, X6 stands for exactly 4^6 = 4096 combinations, from AAAAAA, AAAAAC, ... TTTTTT. Sometimes, we may need more control by specifying the motifs explicitly in a dictionary file, with arbitrary length and IUPAC symbols. This can be facilitated by Topo2Dictfile by first generating a motif dictionary accordingly to user-specified topological patterns, and then editing it as needed, e.g., deleting some motifs, adding more, or introducing IUPAC degeneracy symbols etc.

ProcessFASTA
ProcessFASTA is a simple utility program to process a sequence file in FASTA format, e.g., to select a list of sequences based on ids, convert to reverse complementary, combine id and sequence into one-line etc. While such functionalities are surely available in various heavy-duty toolboxes/environments (BioPerl, EMBOSS, BioConductor etc.), none fits ours needs perfectly. We have thus developed this simple utility program mainly for our own convenience.

ProcessTdat
This is simple utility program to process a tab-delimited text file, e.g., to extract a subset, perform log transformation, and sort entries by id order etc. It is created following the same idea as for ProcessFASTA.

ExtractWindows
A simple utility program to extract sequence fragments from a sequence file, probably of a chromosome.

psamdir2list
A Perl utility program to generate a list of PSAM in a given directory. The resultant list can be fed into AffinityProfile or Transfactivity.

13
Documentation / Transfactivity
« on: September 29, 2016, 01:04:45 pm »
Transfactivity is a utility program that performs multiple-linear regressions of measurements (gene expression or binding data) against affinities. As with AffinityProfile, the affinities can be deduced either from a list of PSAMs or IUPAC motifs, or a single PSAM (-psam=one_PSAM_file) or an IUPAC motif (-motif=one_IUPAC_motif) specified directly on the command-line. The PSAMs can be from a MatrixREDUCE or MotifREDUCE run, or a collection of pseudo-PSAMs from literature (as in the $REDUCE_SUITE/data/PSAMs/ directory).

Transfactivity [options] -sequence=seqfile -measurement=measfile \
                         -psam=one_PSAM_file | -psam_list=list_of_PSAMs |
                         -motif=one_IUPAC_motif | -motif_list=list_of_motifs

  Required parameters:
    -sequence=seqfile      --- sequence file in FASTA format
    -measurement=measfile  --- measurement data file in tab-delimited format
    -psam=one_PSAM_file    --- file name of one PSAM
    -psam_list=list_of_PSAMs --- file name containing a list of PSAMs
    -motif=IUPAC_motif     --- one IUPAC motif
    -motif_list=list_of_motifs --- file name containing a list of IUPAC motifs

  Optional parameters:
    [-damid]               --- short-hand form for -motif=GATC
    [-output=dir_name]     --- path to the output directory (./)
    [-copy]                --- copy CSS, JavsScript and image files to the above
                               output directory to make the HTML self-contained
    [-univariate]          --- switch to run univariate fit only
    [-acgt]                ---  i.e., -motif_list=$REDUCE_SUITE/data/acgt.dat
    [-resid_file=file_name] --- name of residuals
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                   with -motif, default to leading strand
                                   with -psam, default to PSAM setting
    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                               stdout or a specific file (stderr)
    [-help]                --- print out this help message

Usage:
    Transfactivity \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -psam_list=psams.list

    Transfactivity \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -motif_list=motifs.list


Note:

Given a directory that contains all the PSAMs one is interested in, the PSAM-list file can be conveniently generated with the Perl script "psamdir2list". This trick applies to Transfactivity here as well as to AffinityProfile.

For example, the PSAM list in $REDUCE_SUITE/examples/Transfactivity/MacIsaac.list was generated as:

Code: PHP
  1. # Within directory $REDUCE_Suite/examples/Transfactivity
  2. psamdir2list ../../data/PSAMs/MacIsaac MacIsaac.list

As another example, the Jaspar PSAM list can be generated as:

Code: PHP
  1. psamdir2list $REDUCE_SUITE/data/PSAMs/Jaspar jaspar_psam.lst

14
Documentation / AffinityProfile
« on: September 29, 2016, 12:54:18 pm »
AffinityProfile is designed to scan a sequence file against a list of PSAMs or IUPAC motifs to get single base resolution affinity profiles. For convenience, it also allows for a single PSAM (-psam=one_PSAM_file) or an IUPAC motif (-motif=one_IUPAC_motif) to be specified directly on the command line. The PSAMs can be from a MatrixREDUCE or MotifREDUCE run, or a collection of pseudo-PSAMs from literature (as in $REDUCE_SUITE/data/PSAMs/ directory).

AffinityProfile [options] -sequence=seqfile \
                         -psam=one_PSAM_file | -psam_list=list_of_PSAMs |
                         -motif=one_IUPAC_motif | -motif_list=list_of_motifs

  Required parameters:
    -sequence=file_name  --- name of sequence file in FASTA format
    -psam=one_PSAM_file    --- file name of one PSAM
    -psam_list=list_of_PSAMs --- file name containing a list of PSAMs
    -motif=IUPAC_motif     --- one IUPAC motif
    -motif_list=list_of_motifs --- file name containing a list of IUPAC motifs

  Optional parameters:
    [-threshold=float]   --- threshold of affinity for output (0.0)
    [-output=dir_name]   --- path to the output directory (./)
    [-prefix=string]     --- prepended to output profile name (aff_)
    [-affsum=string]     --- file of total affinity per sequence (seq_psam.dat)
    [-detail]            --- also output detailed affinity along each sequence
    [-ids=string]        --- a ',' or ';' delimited list of IDs
    [-column]            --- used with -ids, set profile column-wise for each id
    [-normalize]         --- linear re-scale (per PSAM) the maximum profile to 1.0
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                   with -motif, default to leading strand
                                   with -psam, default to PSAM setting

  Usage:
      (1) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -psam_list=psams.list
      (2) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -motif=aaaccct
      (3) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -motif=aaaga -ids='YAL001C;YAL002W' -column


Notes:

  • By default, the reported affinities take into account of the threshold (default to 0) specified on the command-line. Thus, per slide window, if the affinity is less than the threshold, it will neither be outputted per window nor added into the sum per sequence (as in file seq_psam_thr.dat, see below).
  • AffinityProfile also outputs two files with fixed names: seq_psam.dat (which sums up all affinities, even those below threshold, for reference and comparison) and seq_psam_thr.dat (taking into account of threshold) that contain the sum of affinities per sequence in a tab-delimited format that can be fed directly into Transfactivity.
  • Following each run of MatrixREDUCE or MotifREDUCE, two fix-named files, psams.list and motifs.list, are also available, which can be fed into AffinityProfile, as shown in the first example above.
  • This is yet another example illustrating how the REDUCE Suite has been designed with tools that are seamlessly inter-connected to allow for great flexibility.

15
Documentation / OptimizePSAM
« on: September 29, 2016, 12:41:18 pm »
This program performs a single point optimization of either an initial (pseudo-) PSAM or a seed motif against the measurement file and sequence. Internally, it uses exactly the same Levenberg-Marquardt non-linear least squares fitting algorithm as in MatrixREDUCE.

OptimizePSAM [options] -sequence=seqfile -measurement=measfile \
                       -psam=PSAM_file | -motif=IUPAC_Motif

  Required parameters:
    -sequence=file_name    --- name of a FASTA sequence file
    -meas=measfile         --- measurement (expression/binding) file in tab-delimited format
    -psam=PSAM_file        --- PSAM file to be optimized
    -motif=seed_motif      --- Seed IUPAC motif sequence to be optimized

  Optional parameters:
    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to decide if optimized PSAM is significant (0.001)
    [-filename=file]       --- name of the optimized PSAM
    [-strand=integer]      ---  1 |+1 |F | L for leading strand (1);
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)
    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                               stdout or a file (stderr)
    [-help]                --- print out this help message

  Usage:
    OptimizePSAM \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -motif=ACGCGT -file=ACGCGT.psam


Notes:

  • In PSAM format, the initial seed motif ACGCGT is expressed as follows:
    # A            C            G            T             # no. opt
    # +============+============+============+============ # ==+===+==
      1            0            0            0             #   1   A
      0            1            0            0             #   2   C
      0            0            1            0             #   3   G
      0            1            0            0             #   4   C
      0            0            1            0             #   5   G
      0            0            0            1             #   6   T

  • The optimized PSAM in file ACGCGT.psam is as follows. Note specifically the changes of Ws from 1s and 0s of initial sequence motif (above) to some fractions with a maximum of 1 in each position in the optimized PSAM.
    # A            C            G            T             # no. opt
    # +============+============+============+============ # ==+===+==
      1            0.143386     0.156974     0.332267      #   1   A
      2.38995e-06  1            0.133257     6.623e-18     #   2   C
      0.203947     0.0136109    1            6.43632e-11   #   3   G
      3.39323e-14  1            4.38946e-14  3.19148e-17   #   4   C
      0.0655988    0.122631     1            1.13119e-13   #   5   G
      0.422826     0.221149     0.182984     1             #   6   T

  • As shown in the following diagnostic message from running OptimizePSAM, this optimization step increases the fitted R2 from 0.0414328 to 0.0552883, and the PSAM is significant.
    Best seed experiment:
       number of tested candidate experiments: 18
       intercept: coef=-0.12248   t-value=-18.4713   p-value=5.77026e-74
       slope:     coef=+0.363975   t-value=+15.4353   p-value=1.18198e-52
       r2=0.0414328   SSY=1323.85   SSE=1269   SSR=54.8506
       matches[matched-ids/total-ids]: 348[307/5514]   experiment: alpha_factor_release_sample016 [4]
           and of sequence on forward strand
    Optimizing:
         20 (1250.65): converged with gradient: 0.0349271 <= 0.05
    PSAM linear fit statistics:
       intercept: coef=-0.186363   t-value=-23.1987   p-value=1.12291e-113
       slope:     coef=+0.325095   t-value=+17.9606   p-value=3.83258e-70
       r2=0.0552883   SSY=1323.85   SSE=1250.65   SSR=73.1932
    Checking PSAM significance:
       |r|=0.235135   r0=0.0688157   sigma=0.00888813   t_value=18.7125
       E-value=4.19638e-76
       This PSAM is significant (E-value smaller than specific cutoff of 0.001)


Pages: [1] 2
Created and maintained by Dr. Xiang-Jun Lu [律祥俊]. See also http://forum.x3dna.org and http://x3dna.org