PyGVAMP Development Plan¶

This document tracks the current development priorities and progress for cleaning up and completing the PyGVAMP codebase.

Overview¶

Goals: 1. Consolidate duplicate code 2. Complete missing features 3. Reduce technical debt 4. Improve maintainability

Phase 1: Critical Fixes (Blocking Issues)¶

These issues will cause import errors when loading the package.

1.1 Fix Broken Config Imports¶

Status: ✅ Complete

Problem: pygv/config/__init__.py imports classes that don't exist: - Lines 4: Imports MetaConfig, ML3Config (commented out in base_config.py) - Lines 6-7: Imports from presets/medium.py and presets/large.py (files don't exist)

Solution: - [x] Uncomment MetaConfig and ML3Config in base_config.py - [x] Create presets/medium.py with MediumSchNetConfig and MediumMetaConfig - [x] Create presets/large.py with LargeSchNetConfig and LargeMetaConfig

1.2 Fix Hardcoded CUDA¶

Status: ✅ Complete

Problem: training.py:268 uses model.to('cuda') instead of device variable

Solution: - [x] Change model.to('cuda') to model.to(device) after device determination

Phase 2: Code Consolidation¶

2.1 Merge Duplicate Dataset Files¶

Status: ✅ Complete

Original files: - pygv/dataset/vampnet_dataset.py (base) → moved to legacy/ - pygv/dataset/vampnet_dataset_with_AA.py (amino acid variant) → moved to legacy/ - pygv/dataset/vampnet_dataset_new.py (unified version) → renamed to vampnet_dataset.py

Solution: - [x] Review vampnet_dataset_new.py and determine if it should be the canonical version - [x] Add use_amino_acid_encoding flag to main dataset - [x] Add get_AA_frames() method to main dataset - [x] Load topology lazily when AA encoding is used - [x] Move old dataset files to legacy/ folder - [x] Rename unified dataset to vampnet_dataset.py - [x] Update imports in area51/area52 test files

2.2 Remove Deleted Modules¶

Status: ✅ Complete

Files staged for deletion (already in git): - psevo/ directory (entire module) - viz/ directory (empty)

Solution: - [x] Directories have been deleted

Phase 3: Feature Completion¶

3.1 Integrate ML3 Encoder¶

Status: Pending

Problem: training.py:194 returns encoder = None for ML3 type

Working code exists: pygv/encoder/ml3.py has GNNML3 class

Solution: - [ ] Import GNNML3 in training.py - [ ] Instantiate ML3 encoder with proper config parameters - [ ] Add ML3-specific arguments to args_train.py

3.2 Complete Config Presets¶

Status: ✅ Complete

Needed files: - pygv/config/presets/medium.py - standard training configs - pygv/config/presets/large.py - production training configs

Solution: - [x] Create medium presets with balanced hyperparameters - [x] Create large presets with higher capacity settings

3.3 Non-Continuous Trajectory Support¶

Status: ✅ Complete

Problem: All trajectory files were concatenated as one continuous trajectory, causing time-lagged pairs to incorrectly span across trajectory boundaries (e.g., from the end of one simulation to the start of another).

Solution implemented in vampnet_dataset_new.py: - [x] Added continuous parameter to __init__() (default True for backward compatibility) - [x] Track trajectory boundaries in _process_trajectories() via self.trajectory_boundaries - [x] Filter cross-boundary pairs in _create_time_lagged_pairs() when continuous=False - [x] Updated cache filename to include cont/noncont suffix - [x] Updated cache save/load to include trajectory_boundaries and continuous config - [x] Added continuous: bool = True to BaseConfig in base_config.py

Usage:

# Independent simulations - pairs won't cross trajectory boundaries
dataset = VAMPNetDataset(
    trajectory_files=[...],
    topology_file="protein.pdb",
    lag_time=20.0,
    continuous=False
)

Phase 4: Code Quality¶

4.1 Remove Unused Imports¶

Status: ✅ Complete

Known issues: - training.py:10 - from pymol.querying import distance unused

Solution: - [x] Removed unused import

4.2 Fix NaN Handling¶

Status: Deferred

Problem: vampnet.py replaces NaN outputs with zeros (masking the problem)

Solution: Investigate root cause of NaN generation

Task Priority¶

Priority	Task	Effort	Impact	Status
~~1~~	~~Fix config imports (blocking)~~	~~Low~~	~~Critical~~	✅ Done
~~2~~	~~Remove deleted modules~~	~~Trivial~~	~~Low~~	✅ Done
~~3~~	~~Fix hardcoded CUDA~~	~~Trivial~~	~~Medium~~	✅ Done
~~4~~	~~Remove unused imports~~	~~Trivial~~	~~Low~~	✅ Done
~~5~~	~~Create preset files~~	~~Medium~~	~~Medium~~	✅ Done
~~6~~	~~Complete MetaConfig/ML3Config~~	~~Medium~~	~~Medium~~	✅ Done
~~7~~	~~Non-continuous trajectory support~~	~~Medium~~	~~High~~	✅ Done
~~8~~	~~Merge dataset files~~	~~Medium~~	~~Medium~~	✅ Done
9	Integrate ML3 encoder	Medium	High	Pending

Progress Tracker¶

Completed¶

[x] Read and analyze codebase
[x] Created CODEBASE_SUMMARY.md
[x] Created DEVELOPMENT_PLAN.md (this file)
[x] Phase 1: Critical Fixes (config imports, CUDA hardcoding)
[x] Phase 2.1: Merge dataset files (old files → legacy/, unified → vampnet_dataset.py)
[x] Phase 2.2: Remove deleted modules (psevo/, viz/)
[x] Phase 3.2: Complete config presets (medium.py, large.py)
[x] Phase 3.3: Non-continuous trajectory support (vampnet_dataset.py, base_config.py)
[x] Phase 4.1: Remove unused imports (pymol)

In Progress¶

[ ] Phase 3.1: Integrate ML3 encoder

Pending¶

[ ] Phase 4.2: Fix NaN handling (deferred)

Last updated: 2026-02-04