Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - xiangjun

Pages: 1 [2]
16
General Discussion / Re: What do the bases mean in PSAM?
« on: December 01, 2016, 10:56:56 am »
Dear Pan Shen,

Thanks for using the REDUCE Suite and for asking your questions on the Forum.

The W in the converted PSAM notation means A or T (Weak, since the A-T Watson-Crick pair has two H-bonds, compared to three in a G-C pair). Not surprising, S (for Strong) represents G or C.

More details on "Nucleic acid notation" can be found in the Wikipedia, among many other online resources.

Hope this helps.

Xiang-Jun

17
Documentation / Other utility programs
« on: September 29, 2016, 01:27:04 pm »
The REDUCE Suite distribution also includes the following auxiliary programs. Simple type the corresponding program name with -h (e.g., Convert2PSAM -h) should provide sufficient information to get one started.

Convert2PSAM
As its name suggests, Convert2PSAM is a utility program that converts other commonly used motif (pattern) representations in nucleic acid sequences to PSAM, which is unique to the REDUCE Suite. It can also be used to standardize the various formats to a simplified PWM format for easy communication.

Topo2Dictfile
The default topological pattern mechanism can be used to specify sequence motifs in a compact, convenient, and flexible way. However, it defines the motifs implicitly, has length limit (15 non-gap positions), and does not take into consideration of the IUPAC degenerate symbols. As an example, X6 stands for exactly 4^6 = 4096 combinations, from AAAAAA, AAAAAC, ... TTTTTT. Sometimes, we may need more control by specifying the motifs explicitly in a dictionary file, with arbitrary length and IUPAC symbols. This can be facilitated by Topo2Dictfile by first generating a motif dictionary accordingly to user-specified topological patterns, and then editing it as needed, e.g., deleting some motifs, adding more, or introducing IUPAC degeneracy symbols etc.

ProcessFASTA
ProcessFASTA is a simple utility program to process a sequence file in FASTA format, e.g., to select a list of sequences based on ids, convert to reverse complementary, combine id and sequence into one-line etc. While such functionalities are surely available in various heavy-duty toolboxes/environments (BioPerl, EMBOSS, BioConductor etc.), none fits ours needs perfectly. We have thus developed this simple utility program mainly for our own convenience.

ProcessTdat
This is simple utility program to process a tab-delimited text file, e.g., to extract a subset, perform log transformation, and sort entries by id order etc. It is created following the same idea as for ProcessFASTA.

ExtractWindows
A simple utility program to extract sequence fragments from a sequence file, probably of a chromosome.

psamdir2list
A Perl utility program to generate a list of PSAM in a given directory. The resultant list can be fed into AffinityProfile or Transfactivity.

18
Documentation / Transfactivity
« on: September 29, 2016, 01:04:45 pm »
Transfactivity is a utility program that performs multiple-linear regressions of measurements (gene expression or binding data) against affinities. As with AffinityProfile, the affinities can be deduced either from a list of PSAMs or IUPAC motifs, or a single PSAM (-psam=one_PSAM_file) or an IUPAC motif (-motif=one_IUPAC_motif) specified directly on the command-line. The PSAMs can be from a MatrixREDUCE or MotifREDUCE run, or a collection of pseudo-PSAMs from literature (as in the $REDUCE_SUITE/data/PSAMs/ directory).

Transfactivity [options] -sequence=seqfile -measurement=measfile \
                         -psam=one_PSAM_file | -psam_list=list_of_PSAMs |
                         -motif=one_IUPAC_motif | -motif_list=list_of_motifs

  Required parameters:
    -sequence=seqfile      --- sequence file in FASTA format
    -measurement=measfile  --- measurement data file in tab-delimited format
    -psam=one_PSAM_file    --- file name of one PSAM
    -psam_list=list_of_PSAMs --- file name containing a list of PSAMs
    -motif=IUPAC_motif     --- one IUPAC motif
    -motif_list=list_of_motifs --- file name containing a list of IUPAC motifs

  Optional parameters:
    [-damid]               --- short-hand form for -motif=GATC
    [-output=dir_name]     --- path to the output directory (./)
    [-copy]                --- copy CSS, JavsScript and image files to the above
                               output directory to make the HTML self-contained
    [-univariate]          --- switch to run univariate fit only
    [-acgt]                ---  i.e., -motif_list=$REDUCE_SUITE/data/acgt.dat
    [-resid_file=file_name] --- name of residuals
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                   with -motif, default to leading strand
                                   with -psam, default to PSAM setting
    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                               stdout or a specific file (stderr)
    [-help]                --- print out this help message

Usage:
    Transfactivity \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -psam_list=psams.list

    Transfactivity \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -motif_list=motifs.list


Note:

Given a directory that contains all the PSAMs one is interested in, the PSAM-list file can be conveniently generated with the Perl script "psamdir2list". This trick applies to Transfactivity here as well as to AffinityProfile.

For example, the PSAM list in $REDUCE_SUITE/examples/Transfactivity/MacIsaac.list was generated as:

Code: PHP
  1. # Within directory $REDUCE_Suite/examples/Transfactivity
  2. psamdir2list ../../data/PSAMs/MacIsaac MacIsaac.list

As another example, the Jaspar PSAM list can be generated as:

Code: PHP
  1. psamdir2list $REDUCE_SUITE/data/PSAMs/Jaspar jaspar_psam.lst

19
Documentation / AffinityProfile
« on: September 29, 2016, 12:54:18 pm »
AffinityProfile is designed to scan a sequence file against a list of PSAMs or IUPAC motifs to get single base resolution affinity profiles. For convenience, it also allows for a single PSAM (-psam=one_PSAM_file) or an IUPAC motif (-motif=one_IUPAC_motif) to be specified directly on the command line. The PSAMs can be from a MatrixREDUCE or MotifREDUCE run, or a collection of pseudo-PSAMs from literature (as in $REDUCE_SUITE/data/PSAMs/ directory).

AffinityProfile [options] -sequence=seqfile \
                         -psam=one_PSAM_file | -psam_list=list_of_PSAMs |
                         -motif=one_IUPAC_motif | -motif_list=list_of_motifs

  Required parameters:
    -sequence=file_name  --- name of sequence file in FASTA format
    -psam=one_PSAM_file    --- file name of one PSAM
    -psam_list=list_of_PSAMs --- file name containing a list of PSAMs
    -motif=IUPAC_motif     --- one IUPAC motif
    -motif_list=list_of_motifs --- file name containing a list of IUPAC motifs

  Optional parameters:
    [-threshold=float]   --- threshold of affinity for output (0.0)
    [-output=dir_name]   --- path to the output directory (./)
    [-prefix=string]     --- prepended to output profile name (aff_)
    [-affsum=string]     --- file of total affinity per sequence (seq_psam.dat)
    [-detail]            --- also output detailed affinity along each sequence
    [-ids=string]        --- a ',' or ';' delimited list of IDs
    [-column]            --- used with -ids, set profile column-wise for each id
    [-normalize]         --- linear re-scale (per PSAM) the maximum profile to 1.0
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                   with -motif, default to leading strand
                                   with -psam, default to PSAM setting

  Usage:
      (1) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -psam_list=psams.list
      (2) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -motif=aaaccct
      (3) AffinityProfile -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
                          -motif=aaaga -ids='YAL001C;YAL002W' -column


Notes:

  • By default, the reported affinities take into account of the threshold (default to 0) specified on the command-line. Thus, per slide window, if the affinity is less than the threshold, it will neither be outputted per window nor added into the sum per sequence (as in file seq_psam_thr.dat, see below).
  • AffinityProfile also outputs two files with fixed names: seq_psam.dat (which sums up all affinities, even those below threshold, for reference and comparison) and seq_psam_thr.dat (taking into account of threshold) that contain the sum of affinities per sequence in a tab-delimited format that can be fed directly into Transfactivity.
  • Following each run of MatrixREDUCE or MotifREDUCE, two fix-named files, psams.list and motifs.list, are also available, which can be fed into AffinityProfile, as shown in the first example above.
  • This is yet another example illustrating how the REDUCE Suite has been designed with tools that are seamlessly inter-connected to allow for great flexibility.

20
Documentation / OptimizePSAM
« on: September 29, 2016, 12:41:18 pm »
This program performs a single point optimization of either an initial (pseudo-) PSAM or a seed motif against the measurement file and sequence. Internally, it uses exactly the same Levenberg-Marquardt non-linear least squares fitting algorithm as in MatrixREDUCE.

OptimizePSAM [options] -sequence=seqfile -measurement=measfile \
                       -psam=PSAM_file | -motif=IUPAC_Motif

  Required parameters:
    -sequence=file_name    --- name of a FASTA sequence file
    -meas=measfile         --- measurement (expression/binding) file in tab-delimited format
    -psam=PSAM_file        --- PSAM file to be optimized
    -motif=seed_motif      --- Seed IUPAC motif sequence to be optimized

  Optional parameters:
    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to decide if optimized PSAM is significant (0.001)
    [-filename=file]       --- name of the optimized PSAM
    [-strand=integer]      ---  1 |+1 |F | L for leading strand (1);
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)
    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                               stdout or a file (stderr)
    [-help]                --- print out this help message

  Usage:
    OptimizePSAM \
       -measurement=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -motif=ACGCGT -file=ACGCGT.psam


Notes:

  • In PSAM format, the initial seed motif ACGCGT is expressed as follows:
    # A            C            G            T             # no. opt
    # +============+============+============+============ # ==+===+==
      1            0            0            0             #   1   A
      0            1            0            0             #   2   C
      0            0            1            0             #   3   G
      0            1            0            0             #   4   C
      0            0            1            0             #   5   G
      0            0            0            1             #   6   T

  • The optimized PSAM in file ACGCGT.psam is as follows. Note specifically the changes of Ws from 1s and 0s of initial sequence motif (above) to some fractions with a maximum of 1 in each position in the optimized PSAM.
    # A            C            G            T             # no. opt
    # +============+============+============+============ # ==+===+==
      1            0.143386     0.156974     0.332267      #   1   A
      2.38995e-06  1            0.133257     6.623e-18     #   2   C
      0.203947     0.0136109    1            6.43632e-11   #   3   G
      3.39323e-14  1            4.38946e-14  3.19148e-17   #   4   C
      0.0655988    0.122631     1            1.13119e-13   #   5   G
      0.422826     0.221149     0.182984     1             #   6   T

  • As shown in the following diagnostic message from running OptimizePSAM, this optimization step increases the fitted R2 from 0.0414328 to 0.0552883, and the PSAM is significant.
    Best seed experiment:
       number of tested candidate experiments: 18
       intercept: coef=-0.12248   t-value=-18.4713   p-value=5.77026e-74
       slope:     coef=+0.363975   t-value=+15.4353   p-value=1.18198e-52
       r2=0.0414328   SSY=1323.85   SSE=1269   SSR=54.8506
       matches[matched-ids/total-ids]: 348[307/5514]   experiment: alpha_factor_release_sample016 [4]
           and of sequence on forward strand
    Optimizing:
         20 (1250.65): converged with gradient: 0.0349271 <= 0.05
    PSAM linear fit statistics:
       intercept: coef=-0.186363   t-value=-23.1987   p-value=1.12291e-113
       slope:     coef=+0.325095   t-value=+17.9606   p-value=3.83258e-70
       r2=0.0552883   SSY=1323.85   SSE=1250.65   SSR=73.1932
    Checking PSAM significance:
       |r|=0.235135   r0=0.0688157   sigma=0.00888813   t_value=18.7125
       E-value=4.19638e-76
       This PSAM is significant (E-value smaller than specific cutoff of 0.001)


21
Documentation / HTMLSummary
« on: September 27, 2016, 02:51:44 pm »
Following a MatrixREDUCE or MotifREDUCE run, a bunch of output files (model parameters, PSAMs etc.) are generated. The HTMLSummary utility program is provided to summarize the main results in a intuitive, easy to follow HTML page so that (even non-expert) users can quickly make sense of their findings, e.g., the top list of significant PSAMs (motifs). Internally, HTMLSummary makes system-calls to the utility program LogoGenerator to create the logo images. These inter-connected programs, plus a few others, constitute the REDUCE Suite.

HTMLSummary [options] [-file=HTMLFile]

  Required parameters:
    none

  Optional parameters:
    [-output=dir_name]  --- path to an MatrixREDUCE/MotifREDUCE run directory,
                            used both as input and output for HTML summary (./)
    [-copy]             --- copy CSS, JavsScript and associated image files to
                            the output directory to make the HTML self-contained
    [-rc]               --- logo based on reverse complementary strand
    [-width=ThumbnailImageWidthInPixel]   (145)
    [-height=ThumbnailImageHeightInPixel] (90)
    [-psam_list=list_of_PSAMs] --- generate a summary LOGO page for the list
                                   of PSAMs in the file
    [-file=HTMLFile]    --- HTML output file name (index.html)

  Usage: (following a MatrixREDUCE run)
    HTMLSummary
    HTMLSummary -psam=psams.list -file=psams.html

22
Documentation / MatrixREDUCE
« on: September 27, 2016, 02:43:34 pm »
MatrixREDUCE uses genome-wide occupancy data for a transcription factor (e.g. ChIP-chip or mRNA expression data) and associated nucleotide sequences to discover the sequence-specific binding affinity of the transcription factor. The sequence specificity of the transcription factor's DNA-binding domain is modeled using a position-specific affinity matrix (PSAM), representing the change in the binding affinity (Kd) whenever a specific position within a reference binding sequence is mutated.

MatrixREDUCE [options] -sequence=seqfile -measurement=measfile

  Required parameters:
    -sequence=seqfile     --- sequence file in FASTA format
    -meas=measfile        --- measurement (expression/binding) file in tab-delimited format

  Optional parameters:
    [-topo_list=topofile]  --- name of topology file (up_to_octamers)
    [-topo=topology]       --- single topology pattern, e.g., X3--X4
    [-multifit]            --- switch to seed/optimize using all experiments
                                   [added based on code from Pilar -- thanks!]
    [-dicfile=file]        --- list of motifs to check against. IUPAC wild cards
                                   allowed; no length limit
    [-ntop=integer]        --- number of top seed motifs to print out (10)
    [-iupac_pos=integer]   --- number of positions to check for IUPAC degeneracy (0)
    [-iupac_sym=string]    --- IUPAC symbols to check against ('KMRSWYBDHVN')

    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to stop looking for new PSAMs (0.001)
    [-max_motif=integer]   --- maximum # of PSAMs to search (20)
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)

    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                                   stdout or a specific file (stderr)
    [-help]                --- print out this help message

  Usages:
    mkdir -p results
    MatrixREDUCE \
       -meas=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -topo_list=$REDUCE_SUITE/data/topology/up_to_octamers -o=results
    HTMLSummary -o=results

    mkdir -p X6
    MatrixREDUCE \
       -meas=$REDUCE_SUITE/data/mRNA_expression/Spellman1998AlphaTimeCourse.tsv \
       -sequence=$REDUCE_SUITE/data/sequence/YeastUpstream.fasta \
       -topo=X6 -o=X6
    HTMLSummary -c -o=X6


23
Documentation / MotifREDUCE
« on: September 27, 2016, 02:39:42 pm »
REDUCE is an acronym that stands for Regulatory Element Detection Using Correlation with Expression. Based on a simple model for transcriptional regulation by independently acting transcription factors, REDUCE makes it possible to find regulatory elements based on a single microarray experiment. MotifREDUCE in the REDUCE Suite is a more robust and efficient reimplementation of the "original REDUCE algorithm" by Bussemaker et al (2001).

MotifREDUCE [options] -sequence=seqfile -measurement=measfile

  Required parameters:
    -sequence=seqfile     --- sequence file in FASTA format
    -meas=measfile        --- measurement (expression/binding) file in tab-delimited format

  Optional parameters:
    [-topo_list=topofile]  --- name of topology file (up_to_octamers)
    [-topo=topology]       --- single topology pattern, e.g., X3--X4
    [-dicfile=file]        --- list of motifs to check against. IUPAC wild cards
                                   allowed; no length limit
    [-ntop=integer]        --- number of top seed motifs to print out (10)
    [-iupac_pos=integer]   --- number of positions to check for IUPAC degeneracy (0)
    [-iupac_sym=string]    --- IUPAC symbols to check against ('KMRSWYBDHVN')

    [-output=dir_name]     --- path to the output directory (./)
    [-p_value=float]       --- threshold to stop looking for new motifs (0.001)
    [-max_motif=integer]   --- maximum # of motifs to search (20)
    [-strand=integer]      ---  1 |+1 |F | L for leading strand;
                                2 |+2 |B     for both strands;
                               -1 | R |C     for reverse complementary;
                                0 | A |D     auto-detection (check 1 and 2)

    [-runlog=[stderr|stdout|file]]
                           --- direct running diagnostics message to stderr,
                                   stdout or a specific file (stderr)
    [-help]                --- print out this help message

  Usage:
    mkdir -p results   # use topology file (up_to_heptamers)
    MotifREDUCE \
        -meas=$REDUCE_SUITE/examples/MotifREDUCE/yeast_sample.csv \
        -sequence=$REDUCE_SUITE/examples/MotifREDUCE/genome5pns600.fasta \
        -topo_list=$REDUCE_SUITE/examples/MotifREDUCE/up_to_heptamers \
        -o=results
    HTMLSummary -c -o=results

Notes:
  • The command-line user-interface of MotifREDUCE is identical to that of MatrixREDUCE, but skips the Levenberg-Marquardt non-linear least-squares optimization of weight matrix (Ws). The result is a list of motifs, which are expressed in matrix form with 1s and 0s.
  • The above example dataset takes ~10s and finds 10 significant motifs on a contemporary laptop computer.

24
General Discussion / Re: Generate RNA logos with LogoGenerator
« on: September 26, 2016, 11:18:16 pm »
Hi Kate,

Yes. You need to edit file $REDUCE_SUITE/html/LogoGenerator_PS.def slightly, as shown below.

Current (default settings):
Code: [Select]
/colorDict <<
    (A) green       (a) m_green
    (C) blue        (c) m_blue
    (G) orange      (g) m_orange
    (T) red         (t) m_red
    (U) cyan        (u) m_cyan
    (X) white
>> def

Change (U) cyan to (U) red will achieve what you asked. Of course, you can completely redefine the coloring scheme as you see fit in section "default color set".

A bit background info for the default colors: the A green, C blue, G orange, and T red coloring scheme follows WebLogo for DNA. On the other hand, the 3DNA software for 3D structures of DNA/RNA uses another convention: A red, C yellow, G green, T blue, and U cyan. See http://x3dna.org/articles/seeing-is-understanding-as-well-as-believing for some examples. So I ended up with cyan for U in RNA logo. Moreover, the REDUCE Suite Forum is based on the 3DNA Forum.

Best regards,

Xiang-Jun

25
General Discussion / Re: Generate RNA logos with LogoGenerator
« on: September 22, 2016, 10:29:37 am »
Hi Kate,

Thanks for using the REDUCE Suite and for posting your question on the Forum.

Yes, LogoGenerator is applicable to RNA as well as DNA. You just need to specify an additional -rna option on the command-line, and everything else should work as expected. Please have a try and report back any issues you may have.

Best regards,

Xiang-Jun


PS. The REDUCE Suite has more features than those documented (with the command-line "-h" option). The default settings are designed to work for its advertised purpose only.

26
General Discussion / Re: Defining Color in LogoGenerator
« on: August 11, 2016, 01:40:58 pm »
Hi,

Thanks for being the first to ask questions on the Forum! I'd like to clarify that the LogoGenerator utility is within the REDUCE Suite, which is a completely separate package from FeatureReduce.

As to your question of coloring some Cs (or other bases) in separately color, the answer is no, at least in the current version of LogoGenerator. The DNA/RNA logos, as I am aware of, consist of 4 different bases, as shown in the 4 columns of a PWM or PSAM. To color certain bases differently, such info must be specified in some way in the file, which will break the format.

There are so many logo generators available. Maybe some of them are more advanced than the functionality LogoGenerator currently provides, and can do what you are asking for.

If you want to generate only a fews logos, you could use LogoGenerator to generate the logos in EPS format. Then using Adobe Illustrator, you can edit/modify as you see fit.

Hope this helps.

Xiang-Jun


27
Documentation / LogoGenerator
« on: July 29, 2016, 04:20:35 pm »
The LogoGenerator utility program in the REDUCE Suite is a versatile and robust command-line tool that generates logo images in a variety of styles (raw data, frequency, conventional bit information, or affinity logo in ΔΔG). The input can be either a PSAM, or a multiple sequence alignment file in either FASTA or flat format. The output logo image is in EPS format and is converted to PNG format by default for display in a web page (as from HTMLSummary), using the widely and freely available tool GhostScript tool gs. Other supported image formats include PDF, JPEG, and GIF (further utilizing the convert utility program from ImageMagick).

LogoGenerator [options] -file=PSAM_File

  Required parameters:
    -file=PSAM_file     --- name of the PSAM matrix file to get a logo

  Optional parameters:
    [-output=dir_name]  --- path to the output directory (./)
    [-logo=IMAGE_file]  --- name of output logo image file (base-filename.png)
    [-format=IMAGE_FMT] --- image format of the logo: eps|pdf|jpeg|png (png)
                            Note: o pdf, jpeg and png formats make use of "gs",
                                  o check 'pkg_settings.cfg' for settings
    [-style=LOGO_STYLE] --- style of the logo: raw|freq|bits_info|ddG (bits_info)
    [-type=INPUT_TYPE]  --- input type for generating the logo: PSAM|fasta|flat
                            If in 'fasta' or 'flat' format, the sequences must be
                            already aligned and of the same length (PSAM)
    [-title=string]     --- title string of the logo image
    [-width=float]      --- logo width in cm (12)
    [-height=float]     --- logo height in cm (7.5)
    [-ymin=float]       --- minimum value on the y-axis
    [-ymax=float]       --- maximum value on the y-axis
    [-frame]            --- switch for drawing a bounding box around the logo
    [-bw]               --- switch for black-and-white the logo image
    [-reverse_comp]     --- draw logo based on reverse complementary strand
    [-rna]              --- draw RNA logo (i.e., using U instead of T)

  Usage:
      LogoGenerator -file=$REDUCE_SUITE/data/formats/psam_ex.dat -logo=sample.png

28
Documentation / Summary of the REDUCE Suite v2.2 programs
« on: July 29, 2016, 04:08:51 pm »
The REDUCE Suite v2.2 contains a total of 12 programs, as outlined below. The software is distributed with full source code in ANSI C. For the benefit of users, precompiled binaries for the most common Linux, Mac OS X, and Windows operating systems are also provided. Command-line help message is available for each program by specifying the -h option (e.g., LogoGenerator -h), which also include sample usages to get you started. If you have any questions in using the Suite, please do not hesitate to post them on the Forum.

Motif discovery and model building:

  • MotifREDUCE — An algorithm that builds a motif-based multivariate linear model. REDUCE is an acronym that stands for Regulatory Element Detection Using Correlation with Expression. Based on a simple model for transcriptional regulation by independently acting transcription factors (Bussemaker et al, 2001), REDUCE makes it possible to discover regulatory motifs based on a single microarray experiment. MotifREDUCE is a robust and efficient reimplementation of the original REDUCE algorithm. Required inputs are (i) a genome-wide set of measurements (mRNA expression log-ratios or ChIP fold-enrichments) and (ii) a nucleotide sequence associated with each measurement (e.g., upstream promoter sequence). Output are (i) a set of cis-regulatory oligonucleotide motifs, and (ii) the corresponding regression coefficients.
  • MatrixREDUCE — A more sophisticated algorithm that builds a multivariate linear model based on weight matrices (Foat et al., 2005, 2006). Required inputs are the same as far MotifREDUCE: (i) a genome-wide set of measurements (mRNA expression log-ratios or ChIP fold-enrichments) and (ii) a nucleotide sequence associated with each measurement (e.g., upstream promoter sequence). Outputs include (i) the binding specificity, in the form of a position-specific affinity matrix (PSAM), and (ii) the condition-specific concentration/activity for each of a set of trans-acting factors (TF).
  • OptimizePSAM — Fits PSAM parameters and coefficients for a single-TF model. MatrixREDUCE makes iterative calls to this program to build a multivariate model.
  • Transfactivity — Fit a multivariate linear model to one or more genome-wide sets of measurements. In contrast to MotifREDUCE/MatrixREDUCE, motifs/PSAMs are not inferred from the data, as in, but instead are provided as inputs. This is useful for inferring changes in the (hidden) regulatory activity of one or more TFs of known binding specificity. Transfactivity is a contraction of "trans-factor" and "activity".

Visualization of results:
  • HTMLSummary — A utility for visualizing the result of a MatrixREDUCE or MotifREDUCE run in HTML format.
  • LogoGenerator — A versatile and robust command-line tool that generates logo images in a variety of styles (raw data, frequency, conventional bit information, or affinity logo in ??G). The input can be a PSAM or a multiple sequence alignment file in either FASTA or flat format. The output logo image is in EPS format and is converted to PNG format by default for display in a web page (as from HTMLSummary), using the widely and freely available tool GhostScript tool gs. Other supported image formats include PDF, JPEG, and GIF (further utilizing the convert utility program from ImageMagick).

Affinity-based sequence analysis:
  • AffinityProfile — Convert one or more DNA/RNA sequences to single-nucleotide resolution affinity profiles or a total regional affinity. A set of motifs and/or PSAMs is required as input.

Miscellaneous utilities:
  • Convert2PSAM — Convert commonly used motif (pattern) representations of nucleic acid sequences to PSAM format, which is unique to the REDUCE Suite. It also serves to standardize the various formats to a simplified PWM format for easy communication.
  • Topo2Dictfile — Generate a motif dictionary file according to user-specified topological patterns, allowing for easy user manipulation (deleting/adding specific motifs, introducing IUPAC degeneracy symbols, etc).
  • ProcessFASTA — Process a sequence file in FASTA format to select a list of sequences based on their IDs, convert to reverse complement, combine ID and sequence into a single line, etc.
  • ProcessTdat — Manipulate tab-delimited measurement files (extract a subset of experiments, perform log-transformation, sort entries by ID, etc).
  • ExtractWindows — Extract subsequences from larger sequences (e.g., a chromosome), based on a set of start/end coordinates.

29
Documentation / Set up the REDUCE Suite
« on: April 21, 2016, 11:13:30 am »
Starting from REDUCE Suite v2.2, we've streamlined the downloading process. Users just need to register on this Forum, and log in to see the member-only download section (at the upper-left corner). The Suite is distributed with the source code (in ANSI C) available. Moreover, for user convenience, we have compiled the Suite on common operating systems, including Mac OS X, Linux (64-bit), and Windows (via Cygwin).

Assume you have at least basic knowledge of Linux (Unix/Mac OS X) and know how to use the shell, getting the REDUCE Suite up and running is a really simple process. It involves to set up an environment variable REDUCE_SUITE so the system knows where the Suite has been installed, and an update of your command PATH so that you can run the associated programs conveniently. The whole process is further facilitated by the REDUCE_Suite_setup script, as shown below:

Code: [Select]
To install the REDUCE Suite, do as follows:
  (0) Download the Suite from http://reducesuite.bussemakerlab.org/.
      Note that you *must* register and log in to see the download page.
      Assuming your downloaded tarball is on Mac OS X:
          REDUCE-Suite-v2.2-macosx-intel.tar.gz

  (1) tar zxvf REDUCE-Suite-v2.2-macosx-intel.tar.gz
        This will create a directory named REDUCE-Suite-v2.2/

  (2) cd REDUCE-Suite-v2.2/
        You are now in the REDUCE-Suite-v2.2/ directory

  (3) >>> [optional] ONLY IF you compile REDUCE Suite from source <<<
        (3a) cd src/
        (3b) make
        (3c) cd ../   # back to REDUCE-Suite-v2.2/, as with step (2)

  (4) ./bin/REDUCE_Suite_setup    # assuming you are at directory: REDUCE-Suite-v2.2/
        To run the REDUCE Suite, you need to set up the followings:
          o the environment variable REDUCE_SUITE
          o add $REDUCE_SUITE/bin to your command line search path

        for your 'bash' shell, please add the following into ~/.bashrc:
          --------------------------------------------------------------
            export REDUCE_SUITE='/Users/xiangjun/Luxes/REDUCE_Suite'
            export PATH='/Users/xiangjun/Luxes/REDUCE_Suite/bin':$PATH
          --------------------------------------------------------------

         and then logout and login again, or run the following command:
              source ~/.bashrc

  (5) type MatrixREDUCE -h
           LogoGenerator -h
           HTMLSummary -h
      etc for command-line help and worked examples

  (6) Note: to use HTMLSummary for the summary page, you need to install
            GhostScript. See $REDUCE_SUITE/config/pkg_settings.cfg for
            setting path to the command 'gs'. LogoGenerator generates
            the logo image in EPS format, and uses 'gs' to convert EPS
            into PDF, PNG, or JPG. Additionally, with ImageMagick, you
            can also get the logo image in GIF.

The above instruction should help get you started with the Suite. If you have any questions, please do not hesitate to ask. The Forum has been created for any related questions, comments, or suggestions.

Xiang-Jun

30
Announcements / REDUCE Suite v2.2 is available
« on: February 15, 2016, 12:47:46 pm »
We are pleased to announce the release of the REDUCE Suite v2.2, a set of software tools to model the regulation of gene expression by transcription factors (TF). By directly correlating genome-wide mRNA expression or TF binding data (e.g. ChIP-chip) with associated nucleotide sequences, the REDUCE Suite can discover the sequence-specific binding affinity of a TF from a single experiment, using all measurements simultaneously, and without using any "background" sequence model.

The REDUCE Suite of software programs has been developed and actively maintained by the Laboratory of Dr. Harmen Bussemaker at the Department of Biological Sciences, Columbia University in the City of New York. The suite has its origin in the REDUCE algorithm of Bussemaker et al. in Nature Genetics (2001), which pioneered the use of motif-based linear regression model to discover cis-regulatory elements (motifs) and infer condition-specific transcription factor activities from a single genome-wide mRNA expression profile. Dr. Barrett Foat, a former graduate student in the Bussemaker Lab, extended REDUCE by adding an optimization procedure to obtain a so-called Position Specific Affinity Matrix (PSAM). He implemented his algorithm in a new program, MatrixREDUCE, using Perl and GNU Scientific Library (GSL). Dr. Xiang-Jun Lu has completely rewritten and significantly enhanced the code using pure ANSI C to make each component program efficient and the whole package self-contained.

Following the release of MatrixREDUCE v1.0 in late 2006, we have made many significant additions and improvements to the software based on extensive feedback from within and outside the lab. Specially, we have greatly improved the calculation of P-values based on a heuristic null model described in Foat et al (2008), and developed a versatile topology-based approach to specify motif patterns. Moreover, we have implemented MotifREDUCE as a standalone, yet more robust and efficient, replacement of the original REDUCE program, and created a command-line driven, general-purpose DNA/RNA-related LogoGenerator. Overall, the suite now consists of more than ten standalone, yet interconnected programs. To better reflect both its root and new versatile functionality, the package has been renamed the "REDUCE Suite", currently at version 2.2.

We understand that getting a scientific software tool published is just the beginning; in the long run, it is the continuous refinements and adaptation to the changing world that make a software suite alive. As a matter of fact, the REDUCE Suite contain many unpublished features. Moreover, while standard "no warranty" applies, we stand firmly behind the software. We strive to get back to your questions, suggestions and bug reports quickly and concretely on the Forum. Browsing the Forum should convince you of our dedication to the REDUCE Suite!

The C source code is in the src/ directory with each tarball. Please refer to the post titled "Set up the REDUCE Suite" on how to compile the code yourself. All REDUCE Suite related questions are welcome on the Forum (only). Do not be shy in sharing openly, but CONCRETELY, any difficult/negative experiences you may have in installing or using the software. By asking your questions on the public Forum, you're benefiting not only yourself but also the user community.

Welcome to the REDUCE Suite, and we look forward to communicating with you on the Forum.

Xiang-Jun Lu & Harmen Bussemaker


PS: The REDUCE Suite v2.2 is in a stable status: its key code components and functionality features, without material changes, have been extensively tested and utilized in real-world applications of nearly a decade. Due to my commitment to the NIH funded 3DNA/DSSR project, I am supporting the REDUCE Suite in maintenance mode. No new features are planned, but I will promptly address any REDUCE Suite related questions/bugs, exclusively, via this open Forum.


Note added on July 17, 2018: FeatureREDUCE is not included in the suite and it is not supported on the Forum.

Pages: 1 [2]
Created and maintained by Dr. Xiang-Jun Lu [律祥俊]. See also http://forum.x3dna.org and http://x3dna.org