Valcano
The pipeline to characterize the LTR-RTs family, classify and predict the burst families**
How it works:
1) Get the 5-LTR sequence
perl list_to_5ltr.pl reference.fa.pass.list reference.fa > ltr.fa
2)cluster the 5-LTR sequence and obtain the LTR lib and list
cd-hit-est -i ltr.fa -o clust.out -c 0.8 -aL 0.8 -T 0 -M 0 -n 5 -d 200
3) obtain the lib name and members
perl obtain_lib_list.pl clust.out.clusr > cluster.list
4) run Repeatmasker to masker the genome
RepeatMasker -e ncbi -pa 80 -q -no_is -norna -nolow -div 40 -lib LTRlib.fa -cutoff 225 genome.fa
5) obtain the copies numbers and coverage of each repeat element family
perl fam_coverage.pl LTRlib.fa genome.out genome_size_bp > fam_coverage
6) obtain the full-length LTR-RT sequences and assign accessions to each LTR-RTs
merge the accession with the LTR-family list
1. perl list2_ltr_seq.pl Ta.fa.pass.list Ta.fa full_ltr
full_ltr: prefix of LTR-RT sequences and LTR-RT accessions
2. perl merged_accession.pl full_ltr.ltr.acc cluster.list > cluster_ltr_acc
7) obtain RT by tblastn
makeblastdb -in full_ltr.ltr.fa -dbtype nucl
tblastn -query ../../Gyp/Gyp_marker.fa -db Art.gyp.fa -out out2 -max_target_seqs 1000000000 -max_hsps 1 -evalue 10e-5
Gyp_marker.fa: RT sequence
Art.gyp.fa: LTR full sequence
8) select the longest ORFs for each LTRs
extract_RT_from_blast.pl out2 |sort -k 1,1 -k 3,3nr -|perl -e 'while(<>){chomp;@a=split/\t/,$_;$hash{$a[0]}++;if($hash{$a[0]}==1){print ">$a[0]\n$a[4]\n";}}' -> gyp.RT.fa
9) merge RT domain sequence with marker RT sequences
cat copia.RT.fa ./copia.marker.fa > copia.rt.fa
cat gyp.RT.fa ./gyp.marker.fa > gyp.rt.fa
10) multiple alignment and construct tree and calculate distance
mafft copia.rt.fa > copia.rt.align
fastree -quote copia.rt.align > copia.rt.tree
mafft gyp.rt.fa > gyp.rt.align
fastree -quote gyp.rt.align > gyp.rt.tree
11) assign the domain-based classification to the merged the accession list with the LTR-family
perl assign_domain_based.pl full_ltr copia.rt.fa gyp.rt.fa >cluster_ltr_acc_domain
Continue from 5)
12) We try to investigate the the activate LTR-RTs in each LTR-RTs family
Obtain Lib names and numbers of activate LTR-RTs and Add the list to the coverage file
perl obtain_lib_list_num.pl clust.out.clstr >clust.out.clstr.list
perl add_lib_num_to_TE_cov.pl clust.out.clstr.list fam_coverage >fam_coverage_with_clster.list
13) Assign treeid to the TE family
Re-judge the LTR-RT family Type based on the LTRs sequence and the conserved RT sequences
perl re_judge.pl clust.out.clstr.list cluster_ltr_acc_domain > re_judge.out
14) Merge the Coverage file, Family copy number, Treeid file
perl add_family_info.pl re_judge.out TE.cov > fam_coverage.info
15) one in all
volcano [options] ltr.list genome.fa prefix
Program: volcano
Version: 1.0
Contact: Fei Shen <shenf1028@gmail.com>
Usage: volcano [options] ltr.list genome.fa prefix