RefLnc

Method

RNA-Seq datasets

We use two RNA-Seq datasets. For transcriptome reconstruction, we have screened 7,849 RNA-Seq samples in the GTEx project (v6) based on three criteria: (1) normal human tissue/cell line (SMSTYP = "Normal"); (2) RNA integrity number (RIN) value > 6.0; (3) donors met the overall eligibility criteria for GTEx collection based on answers to eligibility questions (INCEXC = "TRUE"). For analysis in tumor, we filter out FFPE (formalin fixed paraffin embedded) samples from The Cancer Genome Atlas (TCGA) data and retain 6,317 samples from 18 tumors that are frozen soon after surgery to prevent degradation of the RNA and DNA.

Reads mapping and Transcriptome assembly

A standard RNA-Seq analysis pipeline is employed on all samples. We use HISAT2 (version 2.0.1-beta) (Kim et al. 2015) to map the sequencing reads to the human reference genome (version hg38/GRCh38) with the reference splice sites provided (--known-splicesite-infile). We use StringTie (v1.2.2) (Pertea et al. 2015) to assemble transcripts in a reference-guided manner (-G). The reference and assembled transcript models are merged by StringTie merge (-F1) to obtain the merged transcript model. Novel transcripts are obtained by comparing the merged transcript model with the reference model by cuffcompare (Trapnell et al. 2010) ( code != "=" && code != "c"). The preliminary transcript model is obtained by merging the reference transcript model and novel transcript model directly.

Estimating expression abundance and normalization

We estimate the expression levels (FPKM) and read coverage for the preliminary transcript model by running StringTie (v1.2.2) (Pertea et al. 2015) in its expression abundance estimation mode (StringTie -e -b). Quantile normalization is applied to account for library size factors.

lncRNA identification and classification

We identify novel lncRNAs through the two following filters: (1) size selection (length > 200 bp) and (2) lack of coding potential. We develop a stringent filtering pipeline aiming at removing novel transcripts with evidence for protein-coding potential. First, we integrate Coding Potential Calculator (CPC) (Kong et al. 2007) and Coding Potential Assessment Tool (CPAT) (Wang et al. 2013): transcripts that are predicted to lack coding potential by either CPAT or CPC are regarded as preliminary noncoding RNAs. Second, we make conceptual translations for three frames of these preliminary noncoding RNAs by ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/). Finally, we scan these translated sequences in the Pfam (Finn et al. 2016) database with three cutoffs (ga/nc/tc), in the 2,201 mass spectrometry samples from Human Proteome Map (Kim et al. 2014) (by X!tandem (Craig and Beavis 2004)) and in the 61 Ribo-Seq profiling samples (by RibORF (Ji et al. 2015)) from SRA database (Leinonen et al. 2011). We remove the transcripts with any hit in the Pfam database, the mass spectrometry data or the Ribo-Seq samples, and obtain the final lncRNA catalog. The lncRNAs are compared to protein-coding transcripts by cuffcompare (Trapnell et al. 2010), and lncRNAs with the code "u" are defined as "intergenic". lncRNAs overlapping with the exons of protein-coding transcripts are defined as "exonic". The remaining lncRNAs are referred to as "others".