IDP

  • IDP is an gene Isoform Detection and Prediction tool
    from Second Generation Sequencing and PacBio sequencing.
    It offers very reliable gene isoform identification
    with high sensitivity
  •  
  •     

Latest News: IDP 0.1.6 major update is released ... read more

Manual

Installation

No explicit installation is required for IDP, however it does have some pre-requistes.

Long reads should be corrected with LSC first, and the corrected.fa and full.fa files should be concatonated into a single fasta for use in IDP i.e.:
$ cat full.fa corrected.fa > LR.fa

Short reads should be aligned to the reference genome using SpliceMap. An alignment (.sam) file and SpliceMap format bed file is required.

Please see IDP requirements for more details about required software.

Using IDP

Firstly, see the tutorial on how to use IDP on some example data.

In order to use IDP on your own data:

  1. Create an empty directory, this will be the working directory. An example directory with a manifest of necessary datafiles can be downloaded here:Example working directory.
  2. Copy "run.cfg" from the IDP package to the working directory.
  3. Edit run.cfg to include paths to data files and the paths of the temp folder and the output folder. You may also want to configure the default settings. Reference annotation can be used in IDP for isoform candidate constructions. In this construction process, a few parameters are needed as well. For these parameters, the example run.cfg contained in the bin/ directory of the program contains the optimal setting for human transcriptome. If you want to run the other species, please contact Kin Fai for suggestion.
  4. Edit "constants.py" for a few "data parameters". Please see the section "Module: constants.py" below for more details
  5. Execute "/home/user/IDP_path/runIDP.py run.cfg mode_number" while in your working directory. or Execute "runIDP.py run.cfg mode_number, if all IDP executable files are in the default bin. "mode_number" can be 1,2,3. Please see how to select mode_number in the section "Module: runIDP.py" below.
  6. After a certain time execution will conclude. You can find results in the "output" directory.

Module: runIDP.py

"runIDP.py" is the main program in the IDP package. It calls other modules to run the full isoform identification and quantification on your data. Output is written to the "output" folder. Details of the output are described in file formats. Its options are described in run.cfg. You just need to run "runIDP.py" with a configuration file "run.cfg" and a mode_number:

/home/user/IDP_path/runIDP.py run.cfg mode_number
Please always type the full path of the bin folder in the command line. In this example, you need "/home/user/IDP_path/runIDP.py" instead of "./runIDP.py" or "runIDP.py".

Input files: long reads

IDP accepts long-read sequences/alignment files and short-read alignment file as input.
The file locations and their format should be set in run.cfg file.


Using long reads corrected by LSC first is advised.
Please see our guide on creating a long read set.
The reason is that corrected.fa will lose some flanking sequences on the long reads that were not corrected by short reads, and there still may be some informative junctions in that region. If we used only corrected.fa, we could lose this information. full.fa includes those flanking regions in addition to the corrections that were made. However, if we used only full.fa, it is likely the IDP algorithm could throw out many of those long reads for failing to find short read support for junctions in those regions. If you combine the two datasets, you will not suffer any loss of information, and any redundancies will be handled by IDP.


Three types of long read data can be input: 1. alignment in gpd format, 2. alignment in psl format and 3. sequences of raw data and reference genome:

1. alignment in gpd format. If the long read alignment is done by BLAT and is in psl format, you can select the best alignment of each long read by the IDP module blat_best.py and then use psl2genephed.py to convert the psl file to gpd file. "skip_Nine" in both modules can help you skip the header lines. or If you have run IDP on a given long read data, then you can copy the file "LR.gpd" from the temp folder for the other run.

2. alignment in psl format. You can BLAT your long reads to reference genome and the output is in psl format. But you need to select one (the best) alignment of each long reads. The IDP module blat_best.py can help you select the alignment with most bases mapped. "skip_Nine" in blat_best.py can help you skip the header lines.

3. sequences of raw data and reference genome. You can input the long read sequences (FASTA format) and the reference genome (FASTA format). IDP can do the alignment by BLAT and the remaining steps for you. This is much handier and just takes a bit longer running time. If you know the adapter/primer sequences at 5'/3' ends, IDP can trim them. But you need to input their homopolyer-compressed sequences.
##
# Long reads files
# You can input one of three types of long-read data: 
	# 1. Long-read alignments on reference genome in GPD format. 
	# If you has run IDP on the same data, you can use LR.gpd in the previous temp folder.
	# For more info of GPD format, please check http://www.healthcare.uiowa.edu/labs/au/IDP/IDP_gpd_format.asp

LR_gpd_pathfilename = /home/kinfai/3seq/IDP_0.1/test_data/no53primer_LR.gpd

	# or
	# 2. Long-read alignments on reference genome in PSL format (BLAT output format). 
	# If the PSL file only contains unique alignment for each long reads, then set psl_type = 1. Otherwise, set psl_type = 0

LR_psl_pathfilename = /home/kinfai/3seq/IDP_0.1/test_data/no53primer_LR.fa.psl
psl_type = 1
     
	# or
	# 3. Long reads in FASTA format (LSC corrected data is preferred) and reference genome in FASTA format. 
	# IDP will run BLAT to align long reads to reference genome.
	# (Optional) If primer sequence at 5' end (five_primer) or 3' end (three_primer) is input, IDP can trim the primer sequences off.
	# Primer sequence must be homopolyer compressed. E.g. the original sequence is AACCCTTGGGG, then you should input ACTG 
	
LR_pathfilename = /home/kinfai/3seq/IDP_0.1/test_data/test.fa
genome_pathfilename = /home/kinfai/3seq/IDP_0.1/test_data/genome.fa
five_primer = AGTACTCTG
three_primer = CGCAGAGTAC
	

Input files: short reads

Two short read data are REQUIRED: 1. alignment in sam format, and 2. junction detection in bed format.

The file locations and their format should be set in run.cfg file.

You can generate these two files very easily from SpliceMap or the other RNA-seq aligners, such as Tophat and MapSplice. At the moment, IDP only supports junctions in the format produced by SpliceMap

##
# Short reads files
# You can input two short-read data: 

SR_jun_pathfilename = /home/kinfai/3seq/IDP_0.1/test_data/junction_color.bed
SR_sam_pathfilename = /home/kinfai/3seq/IDP_0.1/test_data/good_hits.sam

Output files

There are four output files: isoform.gpd, isoform.exp, isoform_detection.gpd, isoform_prediction.gpd in output folder:

Execution Time

The following execution times are guesstimates based on the running times on our servers with 20 threads. These figures will greatly differ based on your data size and your system configuration.