File format specifications
Last updated
Was this helpful?
Last updated
Was this helpful?
File formats in bioinformatics are notoriously hard to standardize. We hope that this documentation provides the user with a clear idea of what is need as input into Swan.
In Swan, transcript models are loaded from GTFs. To work with Swan, GTFs must adhere to the following specifications:
Must contain both transcript and exon features - this is a dependency we would like to remove in the future but for now this is the way it works
gene_id and transcript_id attributes (for transcripts and exons) in column 9.
Recommended: including the transcript_name and gene_name field will enable you to plot genes and transcript with their human-readable names as well
Any non-data header lines must begin with #
Here is an example of what the first few lines of a GTF should look like:
If you are having trouble with your GTF, Swan includes a quick GTF validator which can tell you if your file seems to have an unconventional header or lacks entries needed to run Swan. It cannot tell you if your gene/transcript names/ids match across datasets, or if your exon entries are in the correct order after the corresponding transcript entry. The validator can be run as follows:
Swan can load abundance information for more meaningful analysis and visualizations. To work with Swan, abundance matrices must:
Be tab-separated
First column are transcript IDs that are the same as those loaded via GTF or TALON db
Columns labelled by their dataset names containing raw counts for each transcript
Sample abundance file:
ENST00000416931.1
0
1
ENST00000414273.1
0
2
ENST00000621981.1
0
0
ENST00000514057.1
0
1
ENST00000411249.1
0
0
ENST00000445118.6
1
0
ENST00000441765.5
0
0
AnnDatas used to add expression and metadata must:
Have the transcript ID from the loaded transcriptome / annotation as the index of the AnnData.var
table
Have the dataset name as the index of the AnnData.obs
table
Swan currently works with TALON databases created with TALON v5.0+
Metadata files must:
Contain a column labeled dataset
whose entries correspond to the datasets from an already-added abundance file
Be tab-separated
Sample metadata file (corresponds to above abundance file):
dataset1
M
heart
dataset2
F
liver
Alternatively, a can be used in its unaltered form