User guide¶
Parameters¶
To view command-line parameters type miRge-build -h
:
usage: miRge-build [options]
miRge-build (Enables building small-RNA libraries for an organism of choice to use in the miRge3.0 pipeline)
optional arguments:
-h, --help show this help message and exit
--version show program`s version number and exit
Options:
-g, --genome genome file in fasta format (.fna, .fasta or .fa) (Required)
-mmf, --mature-mir mature miRNA file in fasta format (Required)
-hmf, --hpin-mir hairpin miRNA file in fasta format (Required)
-mtf, --mature-trna mature tRNA file in fasta format (Required)
-ptf, --pre-trna precursor tRNA file in fasta format (Required)
-snorf, --snorna snoRNA file in fasta format (Required)
-rrf, --rrna rRNA file in fasta format (Required)
-ncof, --ncrna-other all other non-coding RNA in fasta format (Required)
-mrf, --mrna mRNA file in fasta format (Required)
-spnf, --spike-in user defined custom spike-in file in fasta format (Optional)
-agff, --ann-gff miRNA annotation gff file (Required)
-ngrs, --gen-repeats the genome repeats file with .gtf extension (Optional: output however enables novel miRNA prediction in the miRge pipeline)
-db, --mir-DB name of the database to be used (Options: miRBase, miRGeneDB) (Required)
-on, --organism-name name of the organism [Note: name should be one word and use "_" as separator if necessary] (Required)
-cpu, --threads the number of processors to use for trimming, qc, and alignment (Default: 1)
-pbwt, --bowtie-path path to system`s directory containing bowtie binary (Required if bowtie is not in the environment path)
File format options¶
Having the right file format is important before making miRge libraries. When dealing with new species which are not available in the set of miRge3.0 libraries, it is important to prioritize what is essential. Novel miRNAs runs scipy cKDTree during library preparation and it consumes a lot of computational resources and time depending on the genome size (up to 10 hours). Making a general build without novel miRNA detection is more straight forward and faster to build libraries.
General format options¶
Example usage¶
Example command usage:
miRge-build -g genome.fasta -mmf nematode_mature_miRBase.fa -hmf hairpin_miR.fa -mtf mature_trna.fasta -ptf pre_trna.fasta -snorf snorna.fasta -rrf rrna.fasta -ncof ncrna_other.fasta -mrf mrna.fasta -agff nematode_miRBase.gff3 -db miRBase -on roundworm -cpu 10 -ngrs WBcel235_genome_repeats.GTF
Output command line:
bowtie version: 1.2.3
Library indexing in progress...
Building the kdTree of roundworm_genome_repeats.GTF....
Building the kdTree of roundworm_genome_repeats.GTFtakes: 1.4s
Transforming roundworm_genome.fa takes: 0.9s
miRge-build is complete in 108.2122 second(s)
Output directory structure:
DB = '--mir-DB'; name of the database used (miRBase or miRGeneDB)
Organism
├── annotation.Libs
│ ├── organism_DB.gff3
│ ├── organism_genome_repeats.pckl (if `-ngrs` is opted)
│ ├── organism_miRNAs_in_repetitive_element_DB.csv (if `-ngrs` is opted)
│ └── organism_merges_DB.csv
├── fasta.Libs
│ ├── organism_genome.pckl (if `-ngrs` is opted)
│ └── organism_merges_DB.fa
└── index.Libs
├── organism_genome*.ebwt
├── organism_hairpin_DB*.ebwt
├── organism_mirna_DB*.ebwt
├── organism_mature_trna*.ebwt
├── organism_pre_trna*.ebwt
├── organism_rrna*.ebwt
├── organism_snorna*.ebwt
├── organism_mrna*.ebwt
├── organism_ncrna_others*.ebwt
├── organism_mature_trna*.ebwt
└── organism_spike-in*.ebwt (Optional)
Name of the database¶
miRge uses miRBase or miRGeneDB as the reference database.
So, it is mandatory to use -db
option to either -db miRBase
or -db miRGeneDB
. Reference miRNA database -db
and annotation GFF -agff
files can be found at miRGeneDB and miRBase.
Name of the organism¶
miRge-build creates and stores all the libraries under the folder which is named after the organism. It is recommended to use a simple name and avoid any special character (use “_” if the name needs to be seperated by a space). Example: -on human
; -on horse
; -on golden_lemur
; -on my_database
etc.
Fasta format¶
Parameters with -g
, -mmf
, -hmf
, -mtf
, -ptf
, -snorf
, -mrf
, -spnf
should be in FASTA format as shown below. -spnf
or –spike-in is optional if the user is interested in adding an additional database with spike-in reads.
FASTA Format:
>Header or Identifier
NUCLEOTIDE SEQUENCE
Ex:
>hsa-let-7a-5p
TGAGGTAGTAGGTTGTATAGTT
NOTE:
The Header ID of hairpin miRNA FASTA should match the mature miRNA FASTA file. This is required for accurate isomiR annotation.
miRge-build fetches 2bp upstream to 5p and 6bp downstream to 3p mature miRNA from the hairpin miRNA based on the matching ID.
Exception: If the mature miRNA name contains XXXX-5p, XXXX-3p, XXXX-[5|3]p*, XXXX_5p or XXXX_3p where XXXX matches the hairpin miRNA ID.
Also, if this is not possible, miRge will not throw any errors, however, and it will proceed with the user provided files.
Novel miRNA options¶
Novel miRNA prediction requires the genome file (which is provided in the general format) and genome repeats file in GTF format, -ngrs
. As mentioned previously, novel miRNA analysis consumes a lot of computational resources and time.
Custom annotation options¶
This is optional, that two files under the annotation.Libs
subdirectory requires users input manually.
_merges_¶
This file structured as organism_merges_database.csv
allows users to define a miRNA family for miRNAs with similar sequences. This method is described in detail in the original miRge manuscript (Baras et al Plos One, 2015).
Below is the guide to format the file, where hsa-miR-376b-5p/376c-5p
is the name of the miRNA family seperated by /
followed by the family members such as hsa-miR-376b-5p
and hsa-miR-376c-5p
all separated by ,
. The next such miRNA family should begin in a new line. Here, four such examples are shown below.
hsa-miR-376b-5p/376c-5p,hsa-miR-376b-5p,hsa-miR-376c-5p
hsa-miR-518c-3p/518f-3p,hsa-miR-518c-3p,hsa-miR-518f-3p
hsa-miR-642a-3p/642b-3p,hsa-miR-642a-3p,hsa-miR-642b-3p
hsa-miR-3155a-3p/3155b,hsa-miR-3155a-3p,hsa-miR-3155b
hsa-miR-3689b-3p/3689c,hsa-miR-3689b-3p,hsa-miR-3689c
_miRNAs_in_repetitive_element_¶
This file structured as organism_miRNAs_in_repetitive_element_database.csv
allows users to define miRNAs that overlap with repeat elements in the genome. This eliminates miRNA reads to be identified as novel miRNAs or identifying one as A-to-I editing, both of which might be misleading.
Below is the guide to format the file, where miRNA names which overlaps with repeat elements are separated by ,
. The gene_id
and transcript_id
of a repeat element should follow the miRNA name. See the example below:
hsa-miR-28-5p,gene_id "L2c"; transcript_id "L2c_dup8856";
hsa-miR-28-3p,gene_id "L2c"; transcript_id "L2c_dup8856";
hsa-miR-95-5p,gene_id "L2c"; transcript_id "L2c_dup382";
hsa-miR-95-3p,gene_id "L2b"; transcript_id "L2b_dup437";
hsa-miR-181c-5p,gene_id "MamRTE1"; transcript_id "MamRTE1_dup11";
Resources¶
- The genome repeats can be obtained from UCSC
- The database sequences for other small RNA can be obtained from UCSC or Ensembl
- Bowtie-v1.2.3 - please pick one based on your OS.