🦢
Swan
  • Swan
  • Tutorials
    • Data preprocessing with TALON
    • Getting started
    • Analysis
    • Visualization
    • Scanpy compatibility
  • FAQs
    • Understanding Swan visualizations
    • Additional utilities
    • SwanGraph data structure
    • File format specifications
  • Code documentation
    • swan_vis.SwanGraph()
    • swan_vis.utils
Powered by GitBook
On this page
  • Table of contents
  • GTF
  • Abundance matrix
  • AnnData
  • TALON db
  • Metadata file

Was this helpful?

  1. FAQs

File format specifications

File formats in bioinformatics are notoriously hard to standardize. We hope that this documentation provides the user with a clear idea of what is need as input into Swan.

Table of contents

  • GTF

  • Abundance matrix

  • AnnData

  • TALON db

GTF

In Swan, transcript models are loaded from GTFs. To work with Swan, GTFs must adhere to the following specifications:

  • Must contain both transcript and exon features - this is a dependency we would like to remove in the future but for now this is the way it works

  • gene_id and transcript_id attributes (for transcripts and exons) in column 9.

  • Recommended: including the transcript_name and gene_name field will enable you to plot genes and transcript with their human-readable names as well

  • Any non-data header lines must begin with #

Here is an example of what the first few lines of a GTF should look like:

##description: evidence-based annotation of the human genome (GRCh38), version 29 (Ensembl 94)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2018-08-30
chr1    HAVANA    gene    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA    exon    11869    12227    .    +    .    gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

If you are having trouble with your GTF, Swan includes a quick GTF validator which can tell you if your file seems to have an unconventional header or lacks entries needed to run Swan. It cannot tell you if your gene/transcript names/ids match across datasets, or if your exon entries are in the correct order after the corresponding transcript entry. The validator can be run as follows:

import swan_vis as swan
swan.validate_gtf('test.gtf')

Abundance matrix

Swan can load abundance information for more meaningful analysis and visualizations. To work with Swan, abundance matrices must:

  • Be tab-separated

  • First column are transcript IDs that are the same as those loaded via GTF or TALON db

  • Columns labelled by their dataset names containing raw counts for each transcript

  • Alternatively, a TALON abundance file can be used in its unaltered form

Sample abundance file:

transcript_id
dataset1
dataset2

ENST00000416931.1

0

1

ENST00000414273.1

0

2

ENST00000621981.1

0

0

ENST00000514057.1

0

1

ENST00000411249.1

0

0

ENST00000445118.6

1

0

ENST00000441765.5

0

0

AnnData

AnnDatas used to add expression and metadata must:

  • Have the transcript ID from the loaded transcriptome / annotation as the index of the AnnData.var table

  • Have the dataset name as the index of the AnnData.obs table

TALON db

Swan currently works with TALON databases created with TALON v5.0+

Metadata file

Metadata files must:

  • Contain a column labeled dataset whose entries correspond to the datasets from an already-added abundance file

  • Be tab-separated

Sample metadata file (corresponds to above abundance file):

dataset
sex
tissue

dataset1

M

heart

dataset2

F

liver

PreviousSwanGraph data structureNextswan_vis.SwanGraph()

Last updated 1 year ago

Was this helpful?