- RNA-seq and Third Generation Sequencing
- Stem cell transcriptome analysis
- We are interested in methodology research of Third Generation Sequencing (TGS) (especially for PacBio and Oxford Nanopore sequencing).
Au lab is working on both hybrid sequencing (Second Generation Sequencing (SGS) + TGS) and TGS-alone methodology research.
Our research interests include but not limited to alternative splicing, isoform construction, gene fusion and quantitative analysis.
- Au lab is applying the hybrid sequencing method on ESC, iPSC and preimplantation embryo, to deeply study the transcriptome differences between stem cells.
- Protein identification and novel splice detections from tandem Mass Spec are our research interests. Au lab is developing statistical methods for Integration of Mass Spec and sequencing data, in order to solve difficult proteomics problems.
CollaboratorsCurrently, we have very close ongoing collobrations with:
- Wing H. Wong's Lab at Stanford
Pacific Biosciences (PacBio)
Renee Reijo Pera's Lab at Stanford
Jack H. Wong's Lab at Houston Methodist Research Institute
Human tRNA synthetase catalytic nulls with diverse functions.
Lo, W.S., Gardiner E., Xu, Z., Lau C.F., Wang, F., Zhou, J.J., Mendlein, J.D., Nangle, L.A., Chiang, K.P., Yang, X.L., Au, Kin Fai, Wong, W.H., Guo, M., Zhang, M., Schimmel, P.
Science. 2014 Jul 18. 345 (6194), 328-32 [Manuscript]
ALS-associated mutation FUS-R521C causes DNA damage and RNA splicing defects
Qiu H., Lee S., Shang Y., Wang W.Y., Au, Kin Fai, Kamiya S., Barmada S.J., Finkbeiner S., Lui H., Carlton C.E., Tang A.A., Oldham M.C., Wang H., Shorter J., Filiano A.J., Roberson E.D., Tourtellotte W.G., Chen B., Tsai L.H., Huang E.J.
The Journal of clinical investigation. 2014 Feb 10. 124 (3), 0-0 [Manuscript]
Characterization of the human ESC transcriptome by hybrid sequencing
Au, Kin Fai, Sebastiano V., Afshar P.T., Durruthy J.D., Lee L., Williams B.A., Bakel H.V., Schadt E., Pera R.A.R., Underwood J., Wong W.H.
Proc. Natl. Acad. Sci. USA 2013 110 (50) E4821-E4830 [Manuscript]
Oct4-Sall4-Nanog network controls developmental progression in the preimplantation mouse embryo
Tan, M.H.*, Au, Kin Fai*, Leong, D.E., Foygel. K., Wong, W.H., Yao, M.W.M.
Molecular Systems Biology. 2013. 9:632. [Manuscript] * These authors contributed equally to this work
02-02-2015: Minor updates and bug fixes to LSC 1.beta
Updating aligners default command options to gain better performance.
- Updated Novoalign command options (tested with >= novocraftV3.02.04 version)
- Updated RazerS3 command options to be compatible with latest version (razers3 3.1.1)
- Added extra clean_up option value to remove all generated intermiediate files in case of issues w/ disk space
- Fix a bug in generating full_LR.fa sequences causing some full read sequences to miss couple of bases
08-04-2014 - Major Update to IDP
This is a major update to IDP and update includes changes to software requirements, additional features, and bug fixes.
- 1. The IDP software is now licensed under Apache 2.0 (a very open license)
- 2. BLAT and seqmap (part of SpliceMap) aligners are no longer bundled. Paths to aligner executables must be specificed in the config file if they are not installed under their default names.
- 3. GMAP can now be used rather than BLAT. This is by setting 'aligner_choice' to either 'gmap' or 'blat' in the config file. GMAP also requires the folder holding the index be specified in the config file.
- 4. An option to use MLE rather than Maximum a posteriori probability (MAP) is available by setting 'estimator_choice' to 'MAP' or 'MLE' in the config file. MAP is used by default, but in data sets with few long reads where few isoforms are detected, MLE should be used.
- 5. A bug was fixed where in the previous version, where IDP should have generated the file when 'detected_exp_len' whas left blank but did not.
08-01-2014 - Preparing long read outputs of LSC for use in IDP
Please concatenate the LSC outputs: corrected.fa with full.fa, and use this new fasta
file as your long read inputs for IDP.
The reason is that corrected.fa will lose some flanking sequences on the long reads that were not corrected by short reads, and there still may be some informative junctions in that region. If we used only corrected.fa, we could lose this information. full.fa includes those flanking regions in addition to the corrections that were made. However, if we used only full.fa, it is likely the IDP algorithm could throw out many of those long reads for failing to find short read support for junctions in those regions. If you combine the two datasets, you will not suffer any loss of information, and any redundancies will be handled by IDP.
04-24-2014 - IDP 0.1.2 minor update is released
This minor update fixes several bugs.
04-17-2014 - IDP 0.1.1 minor update is released
This minor update fixes several bugs fixes to and is accompanied by a convenient, small-sized, test dataset available in the tutorial.
12-01-2013: Faster and much less memory-required LSC 1.alpha is released
In the LSC 0.3.0 or 0.3.1, we optimized the setting of bowtie2 and BWA to get much more short read alignment, which improve the the accuracy of error correction a lot/ However, the increase of alignments also requires much more running time (on both alignment and the following error correction step) and memory usage. Therefore, a few users met difficulty of running LSC 0.3.0 or 0.3.1.
In LSC 1.alpha, we apply probabilistic algorithm ("SCD" option) to select ""enough" short read alignment for error correction. LSC 1.alpha does NOT sacrifice the error correction performace (sensitivity and specificity). Please see http://www.healthcare.uiowa.edu/labs/au/LSC/LSC_manual.html#aligner Thus, we save running time and memory usage significantly. The running time is 30-50% of LSC 0.3.1. The peak memory usage decreases to ~10G regardless of the data size.
- Added probabilistic algorithm ("SCD" option) to pre-select SR alignments results based on LR-SR alignment coverage depth (Significant improvement in running time and memory usage)
- Removed requirement for loading SR dataset in memory to generate LR-SR mapping file (Significant improvement in memory usage)
- Added option "sort_max_mem" in run.cfg to control maximum memory used by unix sort command to avoid unexpected Mem crash
- Fixed a bug in generating FASTQ file (it affected some of QualityValue computation results)
11-26-2013 - IDP 0.1 and the manual and a tutorial are released
IDP integrates short reads (e.g. Illumina data) and long reads (e.g. PacBio data) to identify gene isoforms (transcripts) from transcriptome (see Figure above).
- One input of IDP is the short-read RNA-seq results: junctions (bed file) AND alignments of short reads (sam file).
Most RNA-seq tools, such as SpliceMap and Tophat can output these two files.
- The other input is the long reads: raw sequences (FASTA file) OR alignment of long reads (PSL file by BLAT or GPD file)
The error-corrected long reads from PacBio data is perferred. LSC is our default error-correction tool.
- The IDP output are the gene isoform identifications and quantification of genes and gene isoforms. hESC transcriptome (H1 cell line) is the first one identified by this methods. For more details of this transcriptome, please see its homepage http://www.healthcare.uiowa.edu/labs/au/IDP/hESC.html and our paper Characterization of the human ESC transcriptome by hybrid sequencing [preprint].
11-26-2013: Hompage of hESC transcriptome identified by SpliceMap-LSC-IDP pipline is released.The homepage of hESC transcriptome (H1 cell line) is released. You can also find novel genes, novel isoforms of existing genes (including pluripency markers) and novel ncRNA in this website:
The details of this hESC transcriptome can be in our publication: Characterization of the human ESC transcriptome by hybrid sequencing [preprint]
11-26-2013: IDP and hESC transcriptome paper is releasedKin Fai Au, Vittorio Sebastiano, Pegah Tootoonchi Afshar, Jens Durruthy Durruthy, Lawrence Lee, Brian A. Williams, Honoratus Van Bakel, Eric Schadt, Renee A. Reijo Pera, Jason Underwood, Wing Hung Wong
Characterization of the human ESC transcriptome by hybrid sequencing [preprint]
09-30-2013: More robust and faster LSC 0.3.1
In LSC 0.3.1, we don't have pseudo chromosome, the alignment time reduced to ~10% (in Bowtie2 mode). And you can re-run some crashed jobs easily now.
- Remove pseudo-chr processing
- Accept compressed SR as input (should be named SR.fa.cps/SR.fa.cps.idx in any folder)
- Added "runLSC -cleanup" option to remove redundant files (per thread split, remaining _tmp files) if the run was successful at the end.
- Changed convertNav to sort reads and then generate LR_SR.map (memory optimization instead of loading all alignments in memory)
- Changed "print" to system.echo (messages were not printed out in qsub output files)
- Changed a little bit "cleanup" option to keep per thread data (*.aa, *.ab, ..). It was useful when one thread was crashed and we wanted to just re-run that at the end
08-07-2013: Big changes in LSC 0.3
In LSC 0.3, we have a few updates. They are very IMPORTANT updates, new features and small fixes
Very IMPORTANT updates:
- Support for Bowtie2 and RazerS3 as initial aligners. Now, BWA, Bowtie2, RazerS3 and Novoalign work in LSC. Please see the comparison details of aligners in the "Short read - Long read aligner#manual".
Added SR length coverage percentage on LR (SR-covered length/full length of corrected LR) to corrected_LR output file. Here is an example, where the last number 0.82 is the SR length coverage percentage on LR:
- Added support for three modes for step-wise runs:
- Generating FASTQ output format based on correction probability given short read coverage. Please refer to LSC paper and manual page for more details. You can select well-corrected reads for downstream analyses by using the quality in FASTQ output or SR length coverage percentage above. Please the the filtering in the "Output#manual".
- mode 0: end-to-end
- mode 1: generating LR_SR.map file
- mode 2: correction step
- Used the python path in the cfg file instead of default user/bin path
- Added option (-clean_up) to remove intermediate files or not (Note: important/useful ones will still be there in temp folder)
- Support for input fastq format for LR (long reads) and/or SR (short reads)
- Updated default BWA and novoalign commands options
- Printing out original LR names in the output file
- Support for printing out version number using -v/-version option
Small bug fixed
- Fixed in removing XZ pattern printed out at the end of some uncorrected_LR sequences
- Fixed samParser bug (which was ignoring some valid alignments in BWA output)