Recent Posts

Pages: 1 ... 5 6 [7] 8 9

General Discussion / Re: What do the bases mean in PSAM?

« Last post by melodypluto on December 05, 2016, 09:33:10 pm »

Thank you very much!

General Discussion / Re: What do the bases mean in PSAM?

« Last post by xiangjun on December 01, 2016, 10:56:56 am »

Dear Pan Shen,

Thanks for using the REDUCE Suite and for asking your questions on the Forum.

The W in the converted PSAM notation means A or T (Weak, since the A-T Watson-Crick pair has two H-bonds, compared to three in a G-C pair). Not surprising, S (for Strong) represents G or C.

More details on "Nucleic acid notation" can be found in the Wikipedia, among many other online resources.

Hope this helps.

Xiang-Jun

General Discussion / What do the bases mean in PSAM?

« Last post by melodypluto on December 01, 2016, 01:20:13 am »

Dear administrator,
I used the Convert2PSAM in order to convert the PWM into PSAM.
CODE: Convert2PSAM -source=PW -inpfile=/mnt/tools/REDUCE-Suite-v2.2/data/formats/pwm_ex.dat -psamfile=psam=pw2psam.xml
However, there are some bases named W, instead of anyone among ACGT.
Thus I wonder the meaning of W in PSAM. Is there other type of base in PSAM? If so, what do them mean?
Could it be possible for you to show me the details about the bases in PSAM?

Thank you,
Pan Shen

Documentation / Other utility programs

« Last post by xiangjun on September 29, 2016, 01:27:04 pm »

The REDUCE Suite distribution also includes the following auxiliary programs. Simple type the corresponding program name with -h (e.g., Convert2PSAM -h) should provide sufficient information to get one started.

Convert2PSAM

As its name suggests, Convert2PSAM is a utility program that converts other commonly used motif (pattern) representations in nucleic acid sequences to PSAM, which is unique to the REDUCE Suite. It can also be used to standardize the various formats to a simplified PWM format for easy communication.

Topo2Dictfile

The default topological pattern mechanism can be used to specify sequence motifs in a compact, convenient, and flexible way. However, it defines the motifs implicitly, has length limit (15 non-gap positions), and does not take into consideration of the IUPAC degenerate symbols. As an example, X6 stands for exactly 4^6 = 4096 combinations, from AAAAAA, AAAAAC, ... TTTTTT. Sometimes, we may need more control by specifying the motifs explicitly in a dictionary file, with arbitrary length and IUPAC symbols. This can be facilitated by Topo2Dictfile by first generating a motif dictionary accordingly to user-specified topological patterns, and then editing it as needed, e.g., deleting some motifs, adding more, or introducing IUPAC degeneracy symbols etc.

ProcessFASTA

ProcessFASTA is a simple utility program to process a sequence file in FASTA format, e.g., to select a list of sequences based on ids, convert to reverse complementary, combine id and sequence into one-line etc. While such functionalities are surely available in various heavy-duty toolboxes/environments (BioPerl, EMBOSS, BioConductor etc.), none fits ours needs perfectly. We have thus developed this simple utility program mainly for our own convenience.

ProcessTdat

This is simple utility program to process a tab-delimited text file, e.g., to extract a subset, perform log transformation, and sort entries by id order etc. It is created following the same idea as for ProcessFASTA.

ExtractWindows

A simple utility program to extract sequence fragments from a sequence file, probably of a chromosome.

psamdir2list

A Perl utility program to generate a list of PSAM in a given directory. The resultant list can be fed into AffinityProfile or Transfactivity.

Documentation / Transfactivity

« Last post by xiangjun on September 29, 2016, 01:04:45 pm »

Transfactivity is a utility program that performs multiple-linear regressions of measurements (gene expression or binding data) against affinities. As with AffinityProfile, the affinities can be deduced either from a list of PSAMs or IUPAC motifs, or a single PSAM (-psam=one_PSAM_file) or an IUPAC motif (-motif=one_IUPAC_motif) specified directly on the command-line. The PSAMs can be from a MatrixREDUCE or MotifREDUCE run, or a collection of pseudo-PSAMs from literature (as in the $REDUCE_SUITE/data/PSAMs/ directory).

Transfactivity [options] -sequence=seqfile -measurement=measfile \
                         -psam=one_PSAM_file | -psam_list=list_of_PSAMs |
                         -motif=one_IUPAC_motif | -motif_list=list_of_motifs

  Required parameters:
    -sequence=seqfile      --- sequence file in FASTA format
    -measurement=measfile  --- measurement data file in tab-delimited format
    -psam=one_PSAM_file    --- file name of one PSAM
    -psam_list=list_of_PSAMs --- file name containing a list of PSAMs
    -motif=IUPAC_motif     --- one IUPAC motif
    -motif_list=list_of_motifs --- file name containing a list of IUPAC motifs

  Optional parameters:
    [-damid]               --- short-hand form for -motif=GATC
    [-output=dir_name]     --- path to the output directory (./)
    [-copy]                --- copy CSS, JavsScript and image files to the above
                               output directory to make the HTML self-contained
    [-univariate]          --- switch to run univariate fit only
    [-acgt]                ---  i.e., -motif_list=$REDUCE_SUITE/data/acgt.dat
    [-resid_file=file_name] --- name of residuals
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                   with -motif, default to leading strand
                                   with -psam, default to PSAM setting
    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                               stdout or a specific file (stderr)
    [-help]                --- print out this help message

Usage:
    Transfactivity \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -psam_list=psams.list

    Transfactivity \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -motif_list=motifs.list

Note:

Given a directory that contains all the PSAMs one is interested in, the PSAM-list file can be conveniently generated with the Perl script "psamdir2list". This trick applies to Transfactivity here as well as to AffinityProfile.

For example, the PSAM list in $REDUCE_SUITE/examples/Transfactivity/MacIsaac.list was generated as:

Code: PHP

# Within directory $REDUCE_Suite/examples/Transfactivity
psamdir2list ../../data/PSAMs/MacIsaac MacIsaac.list

As another example, the Jaspar PSAM list can be generated as:

Code: PHP

psamdir2list $REDUCE_SUITE/data/PSAMs/Jaspar jaspar_psam.lst

Documentation / AffinityProfile

« Last post by xiangjun on September 29, 2016, 12:54:18 pm »

AffinityProfile is designed to scan a sequence file against a list of PSAMs or IUPAC motifs to get single base resolution affinity profiles. For convenience, it also allows for a single PSAM (-psam=one_PSAM_file) or an IUPAC motif (-motif=one_IUPAC_motif) to be specified directly on the command line. The PSAMs can be from a MatrixREDUCE or MotifREDUCE run, or a collection of pseudo-PSAMs from literature (as in $REDUCE_SUITE/data/PSAMs/ directory).

AffinityProfile [options] -sequence=seqfile \
                         -psam=one_PSAM_file | -psam_list=list_of_PSAMs |
                         -motif=one_IUPAC_motif | -motif_list=list_of_motifs

  Required parameters:
    -sequence=file_name  --- name of sequence file in FASTA format
    -psam=one_PSAM_file    --- file name of one PSAM
    -psam_list=list_of_PSAMs --- file name containing a list of PSAMs
    -motif=IUPAC_motif     --- one IUPAC motif
    -motif_list=list_of_motifs --- file name containing a list of IUPAC motifs

  Optional parameters:
    [-threshold=float]   --- threshold of affinity for output (0.0)
    [-output=dir_name]   --- path to the output directory (./)
    [-prefix=string]     --- prepended to output profile name (aff_)
    [-affsum=string]     --- file of total affinity per sequence (seq_psam.dat)
    [-detail]            --- also output detailed affinity along each sequence
    [-ids=string]        --- a ',' or ';' delimited list of IDs
    [-column]            --- used with -ids, set profile column-wise for each id
    [-normalize]         --- linear re-scale (per PSAM) the maximum profile to 1.0
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                   with -motif, default to leading strand
                                   with -psam, default to PSAM setting

  Usage:
      (1) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -psam_list=psams.list
      (2) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -motif=aaaccct
      (3) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -motif=aaaga -ids='YAL001C;YAL002W' -column

Notes:

By default, the reported affinities take into account of the threshold (default to 0) specified on the command-line. Thus, per slide window, if the affinity is less than the threshold, it will neither be outputted per window nor added into the sum per sequence (as in file seq_psam_thr.dat, see below).
AffinityProfile also outputs two files with fixed names: seq_psam.dat (which sums up all affinities, even those below threshold, for reference and comparison) and seq_psam_thr.dat (taking into account of threshold) that contain the sum of affinities per sequence in a tab-delimited format that can be fed directly into Transfactivity.
Following each run of MatrixREDUCE or MotifREDUCE, two fix-named files, psams.list and motifs.list, are also available, which can be fed into AffinityProfile, as shown in the first example above.
This is yet another example illustrating how the REDUCE Suite has been designed with tools that are seamlessly inter-connected to allow for great flexibility.

Documentation / OptimizePSAM

« Last post by xiangjun on September 29, 2016, 12:41:18 pm »

This program performs a single point optimization of either an initial (pseudo-) PSAM or a seed motif against the measurement file and sequence. Internally, it uses exactly the same Levenberg-Marquardt non-linear least squares fitting algorithm as in MatrixREDUCE.

OptimizePSAM [options] -sequence=seqfile -measurement=measfile \
                       -psam=PSAM_file | -motif=IUPAC_Motif

  Required parameters:
    -sequence=file_name    --- name of a FASTA sequence file
    -meas=measfile         --- measurement (expression/binding) file in tab-delimited format
    -psam=PSAM_file        --- PSAM file to be optimized
    -motif=seed_motif      --- Seed IUPAC motif sequence to be optimized

  Optional parameters:
    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to decide if optimized PSAM is significant (0.001)
    [-filename=file]       --- name of the optimized PSAM
    [-strand=integer]      ---  1 |+1 |F | L for leading strand (1);
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)
    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                               stdout or a file (stderr)
    [-help]                --- print out this help message

  Usage:
    OptimizePSAM \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -motif=ACGCGT -file=ACGCGT.psam

Notes:

In PSAM format, the initial seed motif ACGCGT is expressed as follows:

# A            C            G            T             # no. opt
# +============+============+============+============ # ==+===+==
  1            0            0            0             #   1   A
  0            1            0            0             #   2   C
  0            0            1            0             #   3   G
  0            1            0            0             #   4   C
  0            0            1            0             #   5   G
  0            0            0            1             #   6   T

The optimized PSAM in file ACGCGT.psam is as follows. Note specifically the changes of Ws from 1s and 0s of initial sequence motif (above) to some fractions with a maximum of 1 in each position in the optimized PSAM.

# A            C            G            T             # no. opt
# +============+============+============+============ # ==+===+==
  1            0.143386     0.156974     0.332267      #   1   A
  2.38995e-06  1            0.133257     6.623e-18     #   2   C
  0.203947     0.0136109    1            6.43632e-11   #   3   G
  3.39323e-14  1            4.38946e-14  3.19148e-17   #   4   C
  0.0655988    0.122631     1            1.13119e-13   #   5   G
  0.422826     0.221149     0.182984     1             #   6   T

As shown in the following diagnostic message from running OptimizePSAM, this optimization step increases the fitted R2 from 0.0414328 to 0.0552883, and the PSAM is significant.

Best seed experiment:
   number of tested candidate experiments: 18
   intercept: coef=-0.12248   t-value=-18.4713   p-value=5.77026e-74
   slope:     coef=+0.363975   t-value=+15.4353   p-value=1.18198e-52
   r2=0.0414328   SSY=1323.85   SSE=1269   SSR=54.8506
   matches[matched-ids/total-ids]: 348[307/5514]   experiment: alpha_factor_release_sample016 [4]
       and of sequence on forward strand
Optimizing:
     20 (1250.65): converged with gradient: 0.0349271 <= 0.05
PSAM linear fit statistics:
   intercept: coef=-0.186363   t-value=-23.1987   p-value=1.12291e-113
   slope:     coef=+0.325095   t-value=+17.9606   p-value=3.83258e-70
   r2=0.0552883   SSY=1323.85   SSE=1250.65   SSR=73.1932
Checking PSAM significance:
   |r|=0.235135   r0=0.0688157   sigma=0.00888813   t_value=18.7125
   E-value=4.19638e-76
   This PSAM is significant (E-value smaller than specific cutoff of 0.001)

Documentation / HTMLSummary

« Last post by xiangjun on September 27, 2016, 02:51:44 pm »

Following a MatrixREDUCE or MotifREDUCE run, a bunch of output files (model parameters, PSAMs etc.) are generated. The HTMLSummary utility program is provided to summarize the main results in a intuitive, easy to follow HTML page so that (even non-expert) users can quickly make sense of their findings, e.g., the top list of significant PSAMs (motifs). Internally, HTMLSummary makes system-calls to the utility program LogoGenerator to create the logo images. These inter-connected programs, plus a few others, constitute the REDUCE Suite.

HTMLSummary [options] [-file=HTMLFile]

  Required parameters:
    none

  Optional parameters:
    [-output=dir_name]  --- path to an MatrixREDUCE/MotifREDUCE run directory,
                            used both as input and output for HTML summary (./)
    [-copy]             --- copy CSS, JavsScript and associated image files to
                            the output directory to make the HTML self-contained
    [-rc]               --- logo based on reverse complementary strand
    [-width=ThumbnailImageWidthInPixel]   (145)
    [-height=ThumbnailImageHeightInPixel] (90)
    [-psam_list=list_of_PSAMs] --- generate a summary LOGO page for the list
                                   of PSAMs in the file
    [-file=HTMLFile]    --- HTML output file name (index.html)

  Usage: (following a MatrixREDUCE run)
    HTMLSummary
    HTMLSummary -psam=psams.list -file=psams.html

Documentation / MatrixREDUCE

« Last post by xiangjun on September 27, 2016, 02:43:34 pm »

MatrixREDUCE uses genome-wide occupancy data for a transcription factor (e.g. ChIP-chip or mRNA expression data) and associated nucleotide sequences to discover the sequence-specific binding affinity of the transcription factor. The sequence specificity of the transcription factor's DNA-binding domain is modeled using a position-specific affinity matrix (PSAM), representing the change in the binding affinity (Kd) whenever a specific position within a reference binding sequence is mutated.

MatrixREDUCE [options] -sequence=seqfile -measurement=measfile

  Required parameters:
    -sequence=seqfile     --- sequence file in FASTA format
    -meas=measfile        --- measurement (expression/binding) file in tab-delimited format

  Optional parameters:
    [-topo_list=topofile]  --- name of topology file (up_to_octamers)
    [-topo=topology]       --- single topology pattern, e.g., X3--X4
    [-multifit]            --- switch to seed/optimize using all experiments
                                   [added based on code from Pilar -- thanks!]
    [-dicfile=file]        --- list of motifs to check against. IUPAC wild cards
                                   allowed; no length limit
    [-ntop=integer]        --- number of top seed motifs to print out (10)
    [-iupac_pos=integer]   --- number of positions to check for IUPAC degeneracy (0)
    [-iupac_sym=string]    --- IUPAC symbols to check against ('KMRSWYBDHVN')

    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to stop looking for new PSAMs (0.001)
    [-max_motif=integer]   --- maximum # of PSAMs to search (20)
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)

    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                                   stdout or a specific file (stderr)
    [-help]                --- print out this help message

  Usages:
    mkdir -p results
    MatrixREDUCE \
       -meas=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -topo_list=$REDUCE_SUITE/data/topology/up_to_octamers -o=results
    HTMLSummary -o=results

    mkdir -p X6
    MatrixREDUCE \
       -meas=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -topo=X6 -o=X6
    HTMLSummary -c -o=X6

Documentation / MotifREDUCE

« Last post by xiangjun on September 27, 2016, 02:39:42 pm »

REDUCE is an acronym that stands for Regulatory Element Detection Using Correlation with Expression. Based on a simple model for transcriptional regulation by independently acting transcription factors, REDUCE makes it possible to find regulatory elements based on a single microarray experiment. MotifREDUCE in the REDUCE Suite is a more robust and efficient reimplementation of the "original REDUCE algorithm" by Bussemaker et al (2001).

MotifREDUCE [options] -sequence=seqfile -measurement=measfile

  Required parameters:
    -sequence=seqfile     --- sequence file in FASTA format
    -meas=measfile        --- measurement (expression/binding) file in tab-delimited format

  Optional parameters:
    [-topo_list=topofile]  --- name of topology file (up_to_octamers)
    [-topo=topology]       --- single topology pattern, e.g., X3--X4
    [-dicfile=file]        --- list of motifs to check against. IUPAC wild cards
                                   allowed; no length limit
    [-ntop=integer]        --- number of top seed motifs to print out (10)
    [-iupac_pos=integer]   --- number of positions to check for IUPAC degeneracy (0)
    [-iupac_sym=string]    --- IUPAC symbols to check against ('KMRSWYBDHVN')

    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to stop looking for new motifs (0.001)
    [-max_motif=integer]   --- maximum # of motifs to search (20)
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)

    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                                   stdout or a specific file (stderr)
    [-help]                --- print out this help message

  Usage:
    mkdir -p results   # use topology file (up_to_heptamers)
    MotifREDUCE \
        -meas=$REDUCE_SUITE/examples/MotifREDUCE/yeast_sample.csv \
        -sequence=$REDUCE_SUITE/examples/MotifREDUCE/genome5pns600.fasta \
        -topo_list=$REDUCE_SUITE/examples/MotifREDUCE/up_to_heptamers \
        -o=results
    HTMLSummary -c -o=results

Notes:

The command-line user-interface of MotifREDUCE is identical to that of MatrixREDUCE, but skips the Levenberg-Marquardt non-linear least-squares optimization of weight matrix (Ws). The result is a list of motifs, which are expressed in matrix form with 1s and 0s.
The above example dataset takes ~10s and finds 10 significant motifs on a contemporary laptop computer.

Pages: 1 ... 5 6 [7] 8 9

REDUCE Suite

News:

Recent Posts

General Discussion / Re: What do the bases mean in PSAM?

General Discussion / Re: What do the bases mean in PSAM?

General Discussion / What do the bases mean in PSAM?

Documentation / Other utility programs

Documentation / Transfactivity

Documentation / AffinityProfile

Documentation / OptimizePSAM

Documentation / HTMLSummary

Documentation / MatrixREDUCE

Documentation / MotifREDUCE