On-Disk Data Layout
PangyPlot preprocesses a pangenome graph into a mix of SQLite databases, memory-mapped numpy arrays, and compressed binary path files. The S, L, P, and W lines from a GFA file are parsed into the SQLite tables described below, Bubble and Chain superstructures are enumerated by BubbleGun, and 2D coordinates come from an ODGI layout TSV.
Directory Layout
By default, the database lives in datastore/graphs/_default_/. Inside that directory are chromosome-specific subdirectories (e.g. datastore/graphs/_default_/chr1/), each holding:
S line information from GFA and layout coordinates.L line information from GFA..binpath) and a JSON index.SQLite Databases
Four SQLite databases hold the authoritative data for each chromosome. Everything else in the chromosome directory is derived from these files plus the BubbleGun output.
segments.db
Property |
Type |
Description |
|---|---|---|
id |
integer |
Primary key. Segment identifier. |
gc_count |
integer |
Number of |
n_count |
integer |
Number of ambiguous bases ( |
length |
integer |
Length of the DNA sequence. |
x1 |
real |
Layout x-coordinate for the start position. |
y1 |
real |
Layout y-coordinate for the start position. |
x2 |
real |
Layout x-coordinate for the end position. |
y2 |
real |
Layout y-coordinate for the end position. |
seq |
text |
DNA sequence (empty string if node represents a deletion). |
links.db
Property |
Type |
Description |
|---|---|---|
id |
text |
Primary key. Unique identifier constructed from |
from_id |
integer |
Source segment ID. |
from_strand |
text |
Orientation of the source segment ( |
to_id |
integer |
Target segment ID. |
to_strand |
text |
Orientation of the target segment ( |
haplotype |
text |
Set of paths that include this link. |
reverse |
text |
Complementary to haplotype, whether link is traversed in reverse. |
frequency |
real |
Fraction of samples that include this link. |
bubbles.db
Property |
Type |
Description |
|---|---|---|
id |
integer |
Primary key. Bubble identifier. |
chain |
integer |
Identifier of the bubble chain this bubble belongs to. |
chain_step |
integer |
Position/order of this bubble within its chain. |
subtype |
text |
Bubble subtype. |
parent |
integer |
ID of parent bubble ( |
children |
text (JSON) |
List of child bubble IDs. |
siblings |
text (JSON) |
List of sibling bubble IDs. |
source |
text (JSON) |
List of source segment IDs. |
sink |
text (JSON) |
List of sink segment IDs. |
inside |
text (JSON) |
List of internal segment IDs. |
range_exclusive |
text (JSON) |
Exclusive range of segment IDs between source and sink. |
range_inclusive |
text (JSON) |
Inclusive range of segment IDs from source to sink. |
length |
integer |
Cumulative length of bubble in bases. |
gc_count |
integer |
Number of |
n_count |
integer |
Number of ambiguous bases ( |
x1 |
float |
Layout x-coordinate (start). |
x2 |
float |
Layout x-coordinate (end). |
y1 |
float |
Layout y-coordinate (start). |
y2 |
float |
Layout y-coordinate (end). |
link_data |
text (JSON) |
Links connecting to the bubble directly. |
step_index.db
Property |
Type |
Description |
|---|---|---|
step |
integer |
Step index along the genome path (0-based). |
genome |
text |
Genome name. Together with |
seg_id |
integer |
Segment ID associated with this step. |
start |
integer |
Start coordinate of the segment on this genome path (1-based). |
end |
integer |
End coordinate of the segment on this genome path. |
Memory-Mapped Indexes (*.mmapindex/)
The hot subset of each SQLite table is replicated into a directory of
numpy .npy files alongside a meta.json describing the dataset
version and row counts. These arrays are memory-mapped at startup, so
querying segment/link/bubble/step properties is O(1) without going
through SQLite.
segments.mmapindex/length,gc_count,x1,y1,x2,y2,validlinks.mmapindex/from_ids,to_ids,from_strands,to_strands, plus a CSR-style adjacency built fromseg_index_flat,seg_index_offsets,seg_index_countsfor fast neighbor lookups.bubbles.mmapindex/ids,start_steps,end_steps,bubble_to_parent,segment_to_bubble(reverse lookup), and a compact layout representation inlayout_ids,layout_x1,layout_x2.steps.mmapindex/starts,ends,segments— sorted arrays used for basepair-to-segment lookup on the reference path.
Each meta.json also records the PangyPlot version that wrote
the index. Indexes stamped with versions listed in
pangyplot.version.COMPATIBLE_VERSIONS are accepted on load;
otherwise the index is regenerated.
Paths (paths/)
Each chromosome directory has a paths/ subdirectory containing one
.binpath file per P/W line from the GFA file. Each
.binpath is a gzipped delta-zigzag-varint payload of the segment
steps along that haplotype — typically ~20× smaller than the JSON
representation used in earlier versions. See
pangyplot/db/path_codec.py for the codec.
Two JSON files sit alongside the .binpath files:
paths/index.jsonMetadata for all paths (file name, full ID, contig, start coordinate, reference flag) keyed by sample name, plus the PangyPlot version that wrote the index.
paths/sample_idx.jsonCompact sample-name-to-integer mapping used by the frontend for color assignment.
Legacy JSON path files and old .binpath files with embedded headers
are auto-migrated on server startup by
pangyplot/preprocess/ensure_paths.py.
Skeleton and Polychain (skeleton/ and polychain.mmapindex/)
Unlike the files above — which are produced by pangyplot add — the
skeleton pipeline runs automatically on the first pangyplot run
startup after a dataset is added or the PangyPlot version changes. The
outputs support the chromosome-scale Chromosome View.
skeleton/polylines.bin.gzGzipped binary encoding of chain polylines at multiple simplification levels.
skeleton/meta.json.gzMetadata describing what is inside
polylines.bin.gz, including the PangyPlot version used to generate it.skeleton/spine-{ref}.json.gzPer-reference spine data — a linearized backbone through the graph used to anchor chain polylines on the reference genome.
polychain-data.json.gzDecomposition of chains into polychains (runs of bubble-free segments), used by the detail-tier force simulation.
polychain.mmapindex/Memory-mapped companion to
polychain-data.json.gzfor fast lookups.
See Rendering Architecture for how these artifacts feed into the skeleton and detail rendering tiers.
Annotations
Annotations/genomic features (e.g., genes, transcripts, exons) are
stored in genome-specific folders under
datastore/annotations/{ref}/{name}/. Inside each folder is a
SQLite database that roughly follows the GFF3 specification.
annotations.db
Property |
Type |
Description |
|---|---|---|
id |
text |
Primary key. Unique identifier for the annotation feature. |
type |
text |
Feature type (e.g., gene, transcript, exon). |
chrom |
text |
Chromosome or contig name. |
start |
integer |
Genomic start coordinate (1-based, inclusive). |
end |
integer |
Genomic end coordinate (1-based, inclusive). |
strand |
text |
Feature strand: |
source |
text |
Origin of the annotation (e.g., GENCODE, RefSeq). |
gene_name |
text |
Associated gene symbol/name. |
exon_number |
integer |
Exon number (if feature is an exon). |
parent |
text |
Parent feature ID (e.g., transcript for an exon). |
tag |
text |
Free-form tag or attribute from source annotation. |
ensembl_canonical |
boolean |
Flag indicating Ensembl canonical transcript (default 0 = false). |
mane_select |
boolean |
Flag indicating MANE Select transcript (default 0 = false). |