.. _schema: On-Disk Data Layout ============================== PangyPlot preprocesses a pangenome graph into a mix of `SQLite `_ databases, memory-mapped numpy arrays, and compressed binary path files. The ``S``, ``L``, ``P``, and ``W`` lines from a GFA file are parsed into the SQLite tables described below, ``Bubble`` and ``Chain`` superstructures are enumerated by BubbleGun, and 2D coordinates come from an ODGI layout TSV. Directory Layout ~~~~~~~~~~~~~~~~ By default, the database lives in ``datastore/graphs/_default_/``. Inside that directory are chromosome-specific subdirectories (e.g. ``datastore/graphs/_default_/chr1/``), each holding: .. raw:: html
segments.db
SQLite — S line information from GFA and layout coordinates.
links.db
SQLite — L line information from GFA.
bubbles.db
SQLite — identified bubbles and their content.
step_index.db
SQLite — reference-path step information per genome.
*.mmapindex/
Memory-mapped numpy array indexes for fast startup.
paths/
Compressed per-haplotype step sequences (.binpath) and a JSON index.
skeleton/
Chromosome-scale polylines and spines (generated at server startup).
SQLite Databases ~~~~~~~~~~~~~~~~ Four SQLite databases hold the authoritative data for each chromosome. Everything else in the chromosome directory is derived from these files plus the BubbleGun output. segments.db ----------- .. list-table:: :header-rows: 1 :widths: 20 10 70 * - Property - Type - Description * - **id** - integer - Primary key. Segment identifier. * - **gc_count** - integer - Number of ``G`` or ``C`` bases in the DNA sequence. * - **n_count** - integer - Number of ambiguous bases (``N``) in the DNA sequence. * - **length** - integer - Length of the DNA sequence. * - **x1** - real - Layout x-coordinate for the start position. * - **y1** - real - Layout y-coordinate for the start position. * - **x2** - real - Layout x-coordinate for the end position. * - **y2** - real - Layout y-coordinate for the end position. * - **seq** - text - DNA sequence (empty string if node represents a deletion). links.db ----------- .. list-table:: :header-rows: 1 :widths: 20 10 70 * - Property - Type - Description * - **id** - text - Primary key. Unique identifier constructed from ``from_id + from_strand + to_id + to_strand``. * - **from_id** - integer - Source segment ID. * - **from_strand** - text - Orientation of the source segment (``+`` or ``-``). * - **to_id** - integer - Target segment ID. * - **to_strand** - text - Orientation of the target segment (``+`` or ``-``). * - **haplotype** - text - Set of paths that include this link. * - **reverse** - text - Complementary to haplotype, whether link is traversed in reverse. * - **frequency** - real - Fraction of samples that include this link. bubbles.db ----------- .. list-table:: :header-rows: 1 :widths: 20 10 70 * - Property - Type - Description * - **id** - integer - Primary key. Bubble identifier. * - **chain** - integer - Identifier of the bubble chain this bubble belongs to. * - **chain_step** - integer - Position/order of this bubble within its chain. * - **subtype** - text - Bubble subtype. * - **parent** - integer - ID of parent bubble (``NULL`` if root). * - **children** - text (JSON) - List of child bubble IDs. * - **siblings** - text (JSON) - List of sibling bubble IDs. * - **source** - text (JSON) - List of source segment IDs. * - **sink** - text (JSON) - List of sink segment IDs. * - **inside** - text (JSON) - List of internal segment IDs. * - **range_exclusive** - text (JSON) - Exclusive range of segment IDs between source and sink. * - **range_inclusive** - text (JSON) - Inclusive range of segment IDs from source to sink. * - **length** - integer - Cumulative length of bubble in bases. * - **gc_count** - integer - Number of ``G`` or ``C`` bases inside the bubble. * - **n_count** - integer - Number of ambiguous bases (``N``) inside the bubble. * - **x1** - float - Layout x-coordinate (start). * - **x2** - float - Layout x-coordinate (end). * - **y1** - float - Layout y-coordinate (start). * - **y2** - float - Layout y-coordinate (end). * - **link_data** - text (JSON) - Links connecting to the bubble directly. step_index.db --------------- .. list-table:: :header-rows: 1 :widths: 20 10 70 * - Property - Type - Description * - **step** - integer - Step index along the genome path (0-based). * - **genome** - text - Genome name. Together with ``step`` forms the primary key. * - **seg_id** - integer - Segment ID associated with this step. * - **start** - integer - Start coordinate of the segment on this genome path (1-based). * - **end** - integer - End coordinate of the segment on this genome path. Memory-Mapped Indexes (``*.mmapindex/``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The hot subset of each SQLite table is replicated into a directory of numpy ``.npy`` files alongside a ``meta.json`` describing the dataset version and row counts. These arrays are memory-mapped at startup, so querying segment/link/bubble/step properties is O(1) without going through SQLite. ``segments.mmapindex/`` ``length``, ``gc_count``, ``x1``, ``y1``, ``x2``, ``y2``, ``valid`` ``links.mmapindex/`` ``from_ids``, ``to_ids``, ``from_strands``, ``to_strands``, plus a CSR-style adjacency built from ``seg_index_flat``, ``seg_index_offsets``, ``seg_index_counts`` for fast neighbor lookups. ``bubbles.mmapindex/`` ``ids``, ``start_steps``, ``end_steps``, ``bubble_to_parent``, ``segment_to_bubble`` (reverse lookup), and a compact layout representation in ``layout_ids``, ``layout_x1``, ``layout_x2``. ``steps.mmapindex/`` ``starts``, ``ends``, ``segments`` — sorted arrays used for basepair-to-segment lookup on the reference path. Each ``meta.json`` also records the PangyPlot ``version`` that wrote the index. Indexes stamped with versions listed in ``pangyplot.version.COMPATIBLE_VERSIONS`` are accepted on load; otherwise the index is regenerated. Paths (``paths/``) ~~~~~~~~~~~~~~~~~~ Each chromosome directory has a ``paths/`` subdirectory containing one ``.binpath`` file per ``P``/``W`` line from the GFA file. Each ``.binpath`` is a gzipped delta-zigzag-varint payload of the segment steps along that haplotype — typically ~20× smaller than the JSON representation used in earlier versions. See ``pangyplot/db/path_codec.py`` for the codec. Two JSON files sit alongside the ``.binpath`` files: ``paths/index.json`` Metadata for all paths (file name, full ID, contig, start coordinate, reference flag) keyed by sample name, plus the PangyPlot version that wrote the index. ``paths/sample_idx.json`` Compact sample-name-to-integer mapping used by the frontend for color assignment. Legacy JSON path files and old ``.binpath`` files with embedded headers are auto-migrated on server startup by ``pangyplot/preprocess/ensure_paths.py``. Skeleton and Polychain (``skeleton/`` and ``polychain.mmapindex/``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Unlike the files above — which are produced by ``pangyplot add`` — the skeleton pipeline runs automatically on the first ``pangyplot run`` startup after a dataset is added or the PangyPlot version changes. The outputs support the chromosome-scale :ref:`chromosome-view`. ``skeleton/polylines.bin.gz`` Gzipped binary encoding of chain polylines at multiple simplification levels. ``skeleton/meta.json.gz`` Metadata describing what is inside ``polylines.bin.gz``, including the PangyPlot version used to generate it. ``skeleton/spine-{ref}.json.gz`` Per-reference spine data — a linearized backbone through the graph used to anchor chain polylines on the reference genome. ``polychain-data.json.gz`` Decomposition of chains into polychains (runs of bubble-free segments), used by the detail-tier force simulation. ``polychain.mmapindex/`` Memory-mapped companion to ``polychain-data.json.gz`` for fast lookups. See :ref:`rendering` for how these artifacts feed into the skeleton and detail rendering tiers. Annotations ~~~~~~~~~~~ Annotations/genomic features (e.g., genes, transcripts, exons) are stored in genome-specific folders under ``datastore/annotations/{ref}/{name}/``. Inside each folder is a SQLite database that roughly follows the GFF3 specification. annotations.db --------------- .. list-table:: :header-rows: 1 :widths: 20 10 70 * - Property - Type - Description * - **id** - text - Primary key. Unique identifier for the annotation feature. * - **type** - text - Feature type (e.g., gene, transcript, exon). * - **chrom** - text - Chromosome or contig name. * - **start** - integer - Genomic start coordinate (1-based, inclusive). * - **end** - integer - Genomic end coordinate (1-based, inclusive). * - **strand** - text - Feature strand: ``+`` or ``-``. * - **source** - text - Origin of the annotation (e.g., GENCODE, RefSeq). * - **gene_name** - text - Associated gene symbol/name. * - **exon_number** - integer - Exon number (if feature is an exon). * - **parent** - text - Parent feature ID (e.g., transcript for an exon). * - **tag** - text - Free-form tag or attribute from source annotation. * - **ensembl_canonical** - boolean - Flag indicating Ensembl canonical transcript (default 0 = false). * - **mane_select** - boolean - Flag indicating MANE Select transcript (default 0 = false).