Valcano

The pipeline to characterize the LTR-RTs family, classify and predict the burst families**

How it works:

1) Get the 5-LTR sequence

perl list_to_5ltr.pl reference.fa.pass.list reference.fa > ltr.fa 

2)cluster the 5-LTR sequence and obtain the LTR lib and list

cd-hit-est -i ltr.fa -o clust.out -c 0.8 -aL 0.8 -T 0 -M 0 -n 5 -d 200

3) obtain the lib name and members

perl obtain_lib_list.pl clust.out.clusr > cluster.list 

4) run Repeatmasker to masker the genome

RepeatMasker -e ncbi -pa 80  -q -no_is -norna -nolow -div 40 -lib LTRlib.fa -cutoff 225  genome.fa  

5) obtain the copies numbers and coverage of each repeat element family

perl fam_coverage.pl LTRlib.fa genome.out genome_size_bp > fam_coverage  

6) obtain the full-length LTR-RT sequences and assign accessions to each LTR-RTs

​ merge the accession with the LTR-family list

1. perl list2_ltr_seq.pl Ta.fa.pass.list Ta.fa full_ltr

full_ltr: prefix of LTR-RT sequences and LTR-RT accessions

2. perl merged_accession.pl full_ltr.ltr.acc  cluster.list > cluster_ltr_acc

7) obtain RT by tblastn

makeblastdb -in full_ltr.ltr.fa -dbtype nucl

tblastn -query ../../Gyp/Gyp_marker.fa -db Art.gyp.fa -out out2 -max_target_seqs 1000000000 -max_hsps 1 -evalue 10e-5 

Gyp_marker.fa: RT sequence 
Art.gyp.fa: LTR full sequence 

8) select the longest ORFs for each LTRs

extract_RT_from_blast.pl out2  |sort -k 1,1 -k 3,3nr -|perl -e 'while(<>){chomp;@a=split/\t/,$_;$hash{$a[0]}++;if($hash{$a[0]}==1){print ">$a[0]\n$a[4]\n";}}' -> gyp.RT.fa  

9) merge RT domain sequence with marker RT sequences

cat copia.RT.fa ./copia.marker.fa > copia.rt.fa
cat gyp.RT.fa ./gyp.marker.fa > gyp.rt.fa 

10) multiple alignment and construct tree and calculate distance

mafft copia.rt.fa > copia.rt.align
fastree -quote copia.rt.align > copia.rt.tree

mafft gyp.rt.fa > gyp.rt.align
fastree -quote gyp.rt.align > gyp.rt.tree

11) assign the domain-based classification to the merged the accession list with the LTR-family

perl assign_domain_based.pl full_ltr copia.rt.fa gyp.rt.fa >cluster_ltr_acc_domain

Continue from 5)

12) We try to investigate the the activate LTR-RTs in each LTR-RTs family

Obtain Lib names and numbers of activate LTR-RTs and Add the list to the coverage file

perl obtain_lib_list_num.pl clust.out.clstr >clust.out.clstr.list

perl add_lib_num_to_TE_cov.pl clust.out.clstr.list fam_coverage  >fam_coverage_with_clster.list

13) Assign treeid to the TE family

Re-judge the LTR-RT family Type based on the LTRs sequence and the conserved RT sequences

perl re_judge.pl clust.out.clstr.list cluster_ltr_acc_domain > re_judge.out

14) Merge the Coverage file, Family copy number, Treeid file

perl add_family_info.pl re_judge.out TE.cov > fam_coverage.info

15) one in all

volcano [options] ltr.list genome.fa prefix

Program: volcano
Version: 1.0
Contact: Fei Shen <shenf1028@gmail.com>

Usage:   volcano [options] ltr.list genome.fa prefix