On-Disk Data Layout

PangyPlot preprocesses a pangenome graph into a mix of SQLite databases, memory-mapped numpy arrays, and compressed binary path files. The S, L, P, and W lines from a GFA file are parsed into the SQLite tables described below, Bubble and Chain superstructures are enumerated by BubbleGun, and 2D coordinates come from an ODGI layout TSV.

Directory Layout

By default, the database lives in datastore/graphs/_default_/. Inside that directory are chromosome-specific subdirectories (e.g. datastore/graphs/_default_/chr1/), each holding:

segments.db
SQLite — S line information from GFA and layout coordinates.
links.db
SQLite — L line information from GFA.
bubbles.db
SQLite — identified bubbles and their content.
step_index.db
SQLite — reference-path step information per genome.
*.mmapindex/
Memory-mapped numpy array indexes for fast startup.
paths/
Compressed per-haplotype step sequences (.binpath) and a JSON index.
skeleton/
Chromosome-scale polylines and spines (generated at server startup).

SQLite Databases

Four SQLite databases hold the authoritative data for each chromosome. Everything else in the chromosome directory is derived from these files plus the BubbleGun output.

segments.db

Property

Type

Description

id

integer

Primary key. Segment identifier.

gc_count

integer

Number of G or C bases in the DNA sequence.

n_count

integer

Number of ambiguous bases (N) in the DNA sequence.

length

integer

Length of the DNA sequence.

x1

real

Layout x-coordinate for the start position.

y1

real

Layout y-coordinate for the start position.

x2

real

Layout x-coordinate for the end position.

y2

real

Layout y-coordinate for the end position.

seq

text

DNA sequence (empty string if node represents a deletion).

bubbles.db

Property

Type

Description

id

integer

Primary key. Bubble identifier.

chain

integer

Identifier of the bubble chain this bubble belongs to.

chain_step

integer

Position/order of this bubble within its chain.

subtype

text

Bubble subtype.

parent

integer

ID of parent bubble (NULL if root).

children

text (JSON)

List of child bubble IDs.

siblings

text (JSON)

List of sibling bubble IDs.

source

text (JSON)

List of source segment IDs.

sink

text (JSON)

List of sink segment IDs.

inside

text (JSON)

List of internal segment IDs.

range_exclusive

text (JSON)

Exclusive range of segment IDs between source and sink.

range_inclusive

text (JSON)

Inclusive range of segment IDs from source to sink.

length

integer

Cumulative length of bubble in bases.

gc_count

integer

Number of G or C bases inside the bubble.

n_count

integer

Number of ambiguous bases (N) inside the bubble.

x1

float

Layout x-coordinate (start).

x2

float

Layout x-coordinate (end).

y1

float

Layout y-coordinate (start).

y2

float

Layout y-coordinate (end).

link_data

text (JSON)

Links connecting to the bubble directly.

step_index.db

Property

Type

Description

step

integer

Step index along the genome path (0-based).

genome

text

Genome name. Together with step forms the primary key.

seg_id

integer

Segment ID associated with this step.

start

integer

Start coordinate of the segment on this genome path (1-based).

end

integer

End coordinate of the segment on this genome path.

Memory-Mapped Indexes (*.mmapindex/)

The hot subset of each SQLite table is replicated into a directory of numpy .npy files alongside a meta.json describing the dataset version and row counts. These arrays are memory-mapped at startup, so querying segment/link/bubble/step properties is O(1) without going through SQLite.

segments.mmapindex/

length, gc_count, x1, y1, x2, y2, valid

links.mmapindex/

from_ids, to_ids, from_strands, to_strands, plus a CSR-style adjacency built from seg_index_flat, seg_index_offsets, seg_index_counts for fast neighbor lookups.

bubbles.mmapindex/

ids, start_steps, end_steps, bubble_to_parent, segment_to_bubble (reverse lookup), and a compact layout representation in layout_ids, layout_x1, layout_x2.

steps.mmapindex/

starts, ends, segments — sorted arrays used for basepair-to-segment lookup on the reference path.

Each meta.json also records the PangyPlot version that wrote the index. Indexes stamped with versions listed in pangyplot.version.COMPATIBLE_VERSIONS are accepted on load; otherwise the index is regenerated.

Paths (paths/)

Each chromosome directory has a paths/ subdirectory containing one .binpath file per P/W line from the GFA file. Each .binpath is a gzipped delta-zigzag-varint payload of the segment steps along that haplotype — typically ~20× smaller than the JSON representation used in earlier versions. See pangyplot/db/path_codec.py for the codec.

Two JSON files sit alongside the .binpath files:

paths/index.json

Metadata for all paths (file name, full ID, contig, start coordinate, reference flag) keyed by sample name, plus the PangyPlot version that wrote the index.

paths/sample_idx.json

Compact sample-name-to-integer mapping used by the frontend for color assignment.

Legacy JSON path files and old .binpath files with embedded headers are auto-migrated on server startup by pangyplot/preprocess/ensure_paths.py.

Skeleton and Polychain (skeleton/ and polychain.mmapindex/)

Unlike the files above — which are produced by pangyplot add — the skeleton pipeline runs automatically on the first pangyplot run startup after a dataset is added or the PangyPlot version changes. The outputs support the chromosome-scale Chromosome View.

skeleton/polylines.bin.gz

Gzipped binary encoding of chain polylines at multiple simplification levels.

skeleton/meta.json.gz

Metadata describing what is inside polylines.bin.gz, including the PangyPlot version used to generate it.

skeleton/spine-{ref}.json.gz

Per-reference spine data — a linearized backbone through the graph used to anchor chain polylines on the reference genome.

polychain-data.json.gz

Decomposition of chains into polychains (runs of bubble-free segments), used by the detail-tier force simulation.

polychain.mmapindex/

Memory-mapped companion to polychain-data.json.gz for fast lookups.

See Rendering Architecture for how these artifacts feed into the skeleton and detail rendering tiers.

Annotations

Annotations/genomic features (e.g., genes, transcripts, exons) are stored in genome-specific folders under datastore/annotations/{ref}/{name}/. Inside each folder is a SQLite database that roughly follows the GFF3 specification.

annotations.db

Property

Type

Description

id

text

Primary key. Unique identifier for the annotation feature.

type

text

Feature type (e.g., gene, transcript, exon).

chrom

text

Chromosome or contig name.

start

integer

Genomic start coordinate (1-based, inclusive).

end

integer

Genomic end coordinate (1-based, inclusive).

strand

text

Feature strand: + or -.

source

text

Origin of the annotation (e.g., GENCODE, RefSeq).

gene_name

text

Associated gene symbol/name.

exon_number

integer

Exon number (if feature is an exon).

parent

text

Parent feature ID (e.g., transcript for an exon).

tag

text

Free-form tag or attribute from source annotation.

ensembl_canonical

boolean

Flag indicating Ensembl canonical transcript (default 0 = false).

mane_select

boolean

Flag indicating MANE Select transcript (default 0 = false).