On-Disk Data Layout

PangyPlot preprocesses a pangenome graph into a mix of SQLite databases, memory-mapped numpy arrays, and compressed binary path files. The S, L, P, and W lines from a GFA file are parsed into the SQLite tables described below, Bubble and Chain superstructures are enumerated by BubbleGun, and 2D coordinates come from an ODGI layout TSV.

Directory Layout

By default, the database lives in datastore/graphs/_default_/. Inside that directory are chromosome-specific subdirectories (e.g. datastore/graphs/_default_/chr1/), each holding:

segments.db

SQLite — S line information from GFA and layout coordinates.

links.db

SQLite — L line information from GFA.

bubbles.db

SQLite — identified bubbles and their content.

step_index.db

SQLite — reference-path step information per genome.

*.mmapindex/

Memory-mapped numpy array indexes for fast startup.

paths/

Compressed per-haplotype step sequences (.binpath) and a JSON index.

skeleton/

Chromosome-scale polylines and spines (generated at server startup).

SQLite Databases

Four SQLite databases hold the authoritative data for each chromosome. Everything else in the chromosome directory is derived from these files plus the BubbleGun output.

segments.db

Property	Type	Description
id	integer	Primary key. Segment identifier.
gc_count	integer	Number of `G` or `C` bases in the DNA sequence.
n_count	integer	Number of ambiguous bases (`N`) in the DNA sequence.
length	integer	Length of the DNA sequence.
x1	real	Layout x-coordinate for the start position.
y1	real	Layout y-coordinate for the start position.
x2	real	Layout x-coordinate for the end position.
y2	real	Layout y-coordinate for the end position.
seq	text	DNA sequence (empty string if node represents a deletion).

links.db

Property	Type	Description
id	text	Primary key. Unique identifier constructed from `from_id + from_strand + to_id + to_strand`.
from_id	integer	Source segment ID.
from_strand	text	Orientation of the source segment (`+` or `-`).
to_id	integer	Target segment ID.
to_strand	text	Orientation of the target segment (`+` or `-`).
haplotype	text	Set of paths that include this link.
reverse	text	Complementary to haplotype, whether link is traversed in reverse.
frequency	real	Fraction of samples that include this link.

bubbles.db

Property	Type	Description
id	integer	Primary key. Bubble identifier.
chain	integer	Identifier of the bubble chain this bubble belongs to.
chain_step	integer	Position/order of this bubble within its chain.
subtype	text	Bubble subtype.
parent	integer	ID of parent bubble (`NULL` if root).
children	text (JSON)	List of child bubble IDs.
siblings	text (JSON)	List of sibling bubble IDs.
source	text (JSON)	List of source segment IDs.
sink	text (JSON)	List of sink segment IDs.
inside	text (JSON)	List of internal segment IDs.
range_exclusive	text (JSON)	Exclusive range of segment IDs between source and sink.
range_inclusive	text (JSON)	Inclusive range of segment IDs from source to sink.
length	integer	Cumulative length of bubble in bases.
gc_count	integer	Number of `G` or `C` bases inside the bubble.
n_count	integer	Number of ambiguous bases (`N`) inside the bubble.
x1	float	Layout x-coordinate (start).
x2	float	Layout x-coordinate (end).
y1	float	Layout y-coordinate (start).
y2	float	Layout y-coordinate (end).
link_data	text (JSON)	Links connecting to the bubble directly.

step_index.db

Property	Type	Description
step	integer	Step index along the genome path (0-based).
genome	text	Genome name. Together with `step` forms the primary key.
seg_id	integer	Segment ID associated with this step.
start	integer	Start coordinate of the segment on this genome path (1-based).
end	integer	End coordinate of the segment on this genome path.

Memory-Mapped Indexes (`*.mmapindex/`)

The hot subset of each SQLite table is replicated into a directory of numpy .npy files alongside a meta.json describing the dataset version and row counts. These arrays are memory-mapped at startup, so querying segment/link/bubble/step properties is O(1) without going through SQLite.

segments.mmapindex/: length, gc_count, x1, y1, x2, y2, valid
links.mmapindex/: from_ids, to_ids, from_strands, to_strands, plus a CSR-style adjacency built from seg_index_flat, seg_index_offsets, seg_index_counts for fast neighbor lookups.
bubbles.mmapindex/: ids, start_steps, end_steps, bubble_to_parent, segment_to_bubble (reverse lookup), and a compact layout representation in layout_ids, layout_x1, layout_x2.
steps.mmapindex/: starts, ends, segments — sorted arrays used for basepair-to-segment lookup on the reference path.

Each meta.json also records the PangyPlot version that wrote the index. Indexes stamped with versions listed in pangyplot.version.COMPATIBLE_VERSIONS are accepted on load; otherwise the index is regenerated.

Paths (`paths/`)

Each chromosome directory has a paths/ subdirectory containing one .binpath file per P/W line from the GFA file. Each .binpath is a gzipped delta-zigzag-varint payload of the segment steps along that haplotype — typically ~20× smaller than the JSON representation used in earlier versions. See pangyplot/db/path_codec.py for the codec.

Two JSON files sit alongside the .binpath files:

paths/index.json: Metadata for all paths (file name, full ID, contig, start coordinate, reference flag) keyed by sample name, plus the PangyPlot version that wrote the index.
paths/sample_idx.json: Compact sample-name-to-integer mapping used by the frontend for color assignment.

Legacy JSON path files and old .binpath files with embedded headers are auto-migrated on server startup by pangyplot/preprocess/ensure_paths.py.

Skeleton and Polychain (`skeleton/` and `polychain.mmapindex/`)

Unlike the files above — which are produced by pangyplot add — the skeleton pipeline runs automatically on the first pangyplot run startup after a dataset is added or the PangyPlot version changes. The outputs support the chromosome-scale Chromosome View.

skeleton/polylines.bin.gz: Gzipped binary encoding of chain polylines at multiple simplification levels.
skeleton/meta.json.gz: Metadata describing what is inside polylines.bin.gz, including the PangyPlot version used to generate it.
skeleton/spine-{ref}.json.gz: Per-reference spine data — a linearized backbone through the graph used to anchor chain polylines on the reference genome.
polychain-data.json.gz: Decomposition of chains into polychains (runs of bubble-free segments), used by the detail-tier force simulation.
polychain.mmapindex/: Memory-mapped companion to polychain-data.json.gz for fast lookups.

See Rendering Architecture for how these artifacts feed into the skeleton and detail rendering tiers.

Annotations

Annotations/genomic features (e.g., genes, transcripts, exons) are stored in genome-specific folders under datastore/annotations/{ref}/{name}/. Inside each folder is a SQLite database that roughly follows the GFF3 specification.

annotations.db

Property	Type	Description
id	text	Primary key. Unique identifier for the annotation feature.
type	text	Feature type (e.g., gene, transcript, exon).
chrom	text	Chromosome or contig name.
start	integer	Genomic start coordinate (1-based, inclusive).
end	integer	Genomic end coordinate (1-based, inclusive).
strand	text	Feature strand: `+` or `-`.
source	text	Origin of the annotation (e.g., GENCODE, RefSeq).
gene_name	text	Associated gene symbol/name.
exon_number	integer	Exon number (if feature is an exon).
parent	text	Parent feature ID (e.g., transcript for an exon).
tag	text	Free-form tag or attribute from source annotation.
ensembl_canonical	boolean	Flag indicating Ensembl canonical transcript (default 0 = false).
mane_select	boolean	Flag indicating MANE Select transcript (default 0 = false).