Skip to content

PyGVAMP Development Plan

This document tracks the current development priorities and progress for cleaning up and completing the PyGVAMP codebase.


Overview

Goals: 1. Consolidate duplicate code 2. Complete missing features 3. Reduce technical debt 4. Improve maintainability


Phase 1: Critical Fixes (Blocking Issues)

These issues will cause import errors when loading the package.

1.1 Fix Broken Config Imports

Status: ✅ Complete

Problem: pygv/config/__init__.py imports classes that don't exist: - Lines 4: Imports MetaConfig, ML3Config (commented out in base_config.py) - Lines 6-7: Imports from presets/medium.py and presets/large.py (files don't exist)

Solution: - [x] Uncomment MetaConfig and ML3Config in base_config.py - [x] Create presets/medium.py with MediumSchNetConfig and MediumMetaConfig - [x] Create presets/large.py with LargeSchNetConfig and LargeMetaConfig

1.2 Fix Hardcoded CUDA

Status: ✅ Complete

Problem: training.py:268 uses model.to('cuda') instead of device variable

Solution: - [x] Change model.to('cuda') to model.to(device) after device determination


Phase 2: Code Consolidation

2.1 Merge Duplicate Dataset Files

Status: ✅ Complete

Original files: - pygv/dataset/vampnet_dataset.py (base) → moved to legacy/ - pygv/dataset/vampnet_dataset_with_AA.py (amino acid variant) → moved to legacy/ - pygv/dataset/vampnet_dataset_new.py (unified version) → renamed to vampnet_dataset.py

Solution: - [x] Review vampnet_dataset_new.py and determine if it should be the canonical version - [x] Add use_amino_acid_encoding flag to main dataset - [x] Add get_AA_frames() method to main dataset - [x] Load topology lazily when AA encoding is used - [x] Move old dataset files to legacy/ folder - [x] Rename unified dataset to vampnet_dataset.py - [x] Update imports in area51/area52 test files

2.2 Remove Deleted Modules

Status: ✅ Complete

Files staged for deletion (already in git): - psevo/ directory (entire module) - viz/ directory (empty)

Solution: - [x] Directories have been deleted


Phase 3: Feature Completion

3.1 Integrate ML3 Encoder

Status: Pending

Problem: training.py:194 returns encoder = None for ML3 type

Working code exists: pygv/encoder/ml3.py has GNNML3 class

Solution: - [ ] Import GNNML3 in training.py - [ ] Instantiate ML3 encoder with proper config parameters - [ ] Add ML3-specific arguments to args_train.py

3.2 Complete Config Presets

Status: ✅ Complete

Needed files: - pygv/config/presets/medium.py - standard training configs - pygv/config/presets/large.py - production training configs

Solution: - [x] Create medium presets with balanced hyperparameters - [x] Create large presets with higher capacity settings

3.3 Non-Continuous Trajectory Support

Status: ✅ Complete

Problem: All trajectory files were concatenated as one continuous trajectory, causing time-lagged pairs to incorrectly span across trajectory boundaries (e.g., from the end of one simulation to the start of another).

Solution implemented in vampnet_dataset_new.py: - [x] Added continuous parameter to __init__() (default True for backward compatibility) - [x] Track trajectory boundaries in _process_trajectories() via self.trajectory_boundaries - [x] Filter cross-boundary pairs in _create_time_lagged_pairs() when continuous=False - [x] Updated cache filename to include cont/noncont suffix - [x] Updated cache save/load to include trajectory_boundaries and continuous config - [x] Added continuous: bool = True to BaseConfig in base_config.py

Usage:

# Independent simulations - pairs won't cross trajectory boundaries
dataset = VAMPNetDataset(
    trajectory_files=[...],
    topology_file="protein.pdb",
    lag_time=20.0,
    continuous=False
)


Phase 4: Code Quality

4.1 Remove Unused Imports

Status: ✅ Complete

Known issues: - training.py:10 - from pymol.querying import distance unused

Solution: - [x] Removed unused import

4.2 Fix NaN Handling

Status: Deferred

Problem: vampnet.py replaces NaN outputs with zeros (masking the problem)

Solution: Investigate root cause of NaN generation


Task Priority

Priority Task Effort Impact Status
~~1~~ ~~Fix config imports (blocking)~~ ~~Low~~ ~~Critical~~ ✅ Done
~~2~~ ~~Remove deleted modules~~ ~~Trivial~~ ~~Low~~ ✅ Done
~~3~~ ~~Fix hardcoded CUDA~~ ~~Trivial~~ ~~Medium~~ ✅ Done
~~4~~ ~~Remove unused imports~~ ~~Trivial~~ ~~Low~~ ✅ Done
~~5~~ ~~Create preset files~~ ~~Medium~~ ~~Medium~~ ✅ Done
~~6~~ ~~Complete MetaConfig/ML3Config~~ ~~Medium~~ ~~Medium~~ ✅ Done
~~7~~ ~~Non-continuous trajectory support~~ ~~Medium~~ ~~High~~ ✅ Done
~~8~~ ~~Merge dataset files~~ ~~Medium~~ ~~Medium~~ ✅ Done
9 Integrate ML3 encoder Medium High Pending

Progress Tracker

Completed

  • [x] Read and analyze codebase
  • [x] Created CODEBASE_SUMMARY.md
  • [x] Created DEVELOPMENT_PLAN.md (this file)
  • [x] Phase 1: Critical Fixes (config imports, CUDA hardcoding)
  • [x] Phase 2.1: Merge dataset files (old files → legacy/, unified → vampnet_dataset.py)
  • [x] Phase 2.2: Remove deleted modules (psevo/, viz/)
  • [x] Phase 3.2: Complete config presets (medium.py, large.py)
  • [x] Phase 3.3: Non-continuous trajectory support (vampnet_dataset.py, base_config.py)
  • [x] Phase 4.1: Remove unused imports (pymol)

In Progress

  • [ ] Phase 3.1: Integrate ML3 encoder

Pending

  • [ ] Phase 4.2: Fix NaN handling (deferred)

Last updated: 2026-02-04