Skip to content

Data Package¤

The top-level artifex.data package is intentionally narrow.

Artifex does not ship a second generic data-processing framework with top-level helpers such as load_dataset, ImageDataset, DataPipeline, or streaming loader classes. The general data-loading story lives in datarax-backed guides and in modality-local dataset helpers.

Current Public Surface¤

Today the retained top-level data surface is:

  • artifex.data.protein

That subpackage exports the protein-specific dataset helpers that are still part of the checked-in runtime:

  • ProteinDataset
  • ProteinDatasetConfig
  • ProteinStructure
  • protein_collate_fn
  • create_synthetic_protein_dataset
  • pdb_to_protein_example

Protein Dataset Example¤

from artifex.data import protein
from artifex.data.protein import ProteinDataset, ProteinDatasetConfig

config = ProteinDatasetConfig(max_seq_length=128)
dataset = ProteinDataset(config, data_dir="./protein-pickles")
batch = dataset.get_batch(4)

assert protein.ProteinDataset is ProteinDataset
print(batch["atom_positions"].shape)

ProteinDataset is backed by datarax's DataSourceModule, so it keeps the standard Datarax indexing, iteration, batching, and Pipeline(...) integration story.

Where The Broader Data Story Lives¤

For general data-loading guidance, use the datarax-backed docs that describe the surviving runtime directly:

For modality-specific synthetic dataset helpers, use the owners under artifex.generative_models.modalities.*.datasets rather than expecting extra subpackages under artifex.data.