File format specifications
File formats in bioinformatics are notoriously hard to standardize. We hope that this documentation provides the user with a clear idea of what is need as input into Swan.
Table of contents
GTF
In Swan, transcript models are loaded from GTFs. To work with Swan, GTFs must adhere to the following specifications:
Must contain both transcript and exon features - this is a dependency we would like to remove in the future but for now this is the way it works
gene_id and transcript_id attributes (for transcripts and exons) in column 9.
Recommended: including the transcript_name and gene_name field will enable you to plot genes and transcript with their human-readable names as well
Any non-data header lines must begin with #
Here is an example of what the first few lines of a GTF should look like:
If you are having trouble with your GTF, Swan includes a quick GTF validator which can tell you if your file seems to have an unconventional header or lacks entries needed to run Swan. It cannot tell you if your gene/transcript names/ids match across datasets, or if your exon entries are in the correct order after the corresponding transcript entry. The validator can be run as follows:
Abundance matrix
Swan can load abundance information for more meaningful analysis and visualizations. To work with Swan, abundance matrices must:
Be tab-separated
First column are transcript IDs that are the same as those loaded via GTF or TALON db
Columns labelled by their dataset names containing raw counts for each transcript
Alternatively, a TALON abundance file can be used in its unaltered form
Sample abundance file:
ENST00000416931.1
0
1
ENST00000414273.1
0
2
ENST00000621981.1
0
0
ENST00000514057.1
0
1
ENST00000411249.1
0
0
ENST00000445118.6
1
0
ENST00000441765.5
0
0
AnnData
AnnDatas used to add expression and metadata must:
Have the transcript ID from the loaded transcriptome / annotation as the index of the
AnnData.var
tableHave the dataset name as the index of the
AnnData.obs
table
TALON db
Swan currently works with TALON databases created with TALON v5.0+
Metadata file
Metadata files must:
Contain a column labeled
dataset
whose entries correspond to the datasets from an already-added abundance fileBe tab-separated
Sample metadata file (corresponds to above abundance file):
dataset1
M
heart
dataset2
F
liver
Last updated