Iteratively train an ML model on a dataset#

In the previous tutorial, we loaded an entire dataset into memory to perform a simple analysis.

Here, we’ll iterate over the files within the dataset, to train an ML model.

import lamindb as ln
import anndata as ad
import numpy as np

💡 loaded instance: testuser1/test-scrna (lamindb 0.56a1)

ln.track()

💡 notebook imports: anndata==0.9.2 lamindb==0.56a1 numpy==1.25.2 scvi-tools==1.0.4

💡 Transform(id=5, uid='Qr1kIHvK506rz8', name='Iteratively train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-10-16 21:48:20, created_by_id=1)

💡 Run(id=5, uid='W2pBST5UpRtEsWPgpTEn', run_at=2023-10-16 21:48:20, transform_id=5, created_by_id=1)

Setup#

dataset_v2 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="2").one()

dataset_v2

Dataset(id=2, uid='PUwAqZ7C9c7EBQ7f0IgN', name='My versioned scRNA-seq dataset', version='2', hash='JNjc88f22TLVPJHdgo7X', updated_at=2023-10-16 21:47:46, transform_id=2, run_id=2, initial_version_id=1, created_by_id=1)

We import scvi-tools.

import scvi

Similar to what we did in the previous tutorial, we could load the entire dataset into memory and train a model in 4 lines of code.

Let us instead load all file records:

file1, file2 = dataset_v2.files.list()

We’d like some context on what the first file contains and where it’s from:

file1.describe()
file1.view_flow()

Show code cell output Hide code cell output

File(id=1, uid='MpmFBPjOegAuuyRzbGAq', suffix='.h5ad', accessor='AnnData', description='Conde22', size=57615999, hash='6Hu1BywwK6bfIU2Dpku2xZ', hash_type='sha1-fl', updated_at=2023-10-16 21:47:05)

Provenance:
  🗃️ storage: Storage(id=1, uid='bEvhinvc', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-10-16 21:45:55, created_by_id=1)
  📔 transform: Transform(id=1, uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-10-16 21:46:01, created_by_id=1)
  👣 run: Run(id=1, uid='Tcf2JasApz9P8RE7gqna', run_at=2023-10-16 21:46:01, transform_id=1, created_by_id=1)
  👤 created_by: User(id=1, uid='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-16 21:45:55)
  ⬇️ input_of (core.Run): ['2023-10-16 21:47:14', '2023-10-16 21:47:54']
Features:
  var: FeatureSet(id=1, uid='ISCdUu2vePwTzsv99UmG', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-10-16 21:46:57, modality_id=1, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(id=2, uid='AeMRCgTDsUFRNwXFlqD9', n=4, registry='core.Feature', hash='xPTyeKYm-_4RH5MEI97t', updated_at=2023-10-16 21:46:58, modality_id=2, created_by_id=1)
    🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

https://d33wubrfki0l68.cloudfront.net/f2c32262ca694fe81b07de1a123901233c165a93/bcbeb/_images/e1da1f824ad6381942b9f43c039a5cbee9b8ce0e9fefff1272578665e501f920.svg

We’ll need to make a decision on the features that we want to use for training the model.

Because each file is validated, they’re all indexed by ensembl_gene_id in the var slot of AnnData.

shared_genes = file1.features["var"] & file2.features["var"]
shared_genes_ensembl = shared_genes.list("ensembl_gene_id")

Train the model#

Let us load the first file into memory:

data_train1 = file1.load().raw[:, shared_genes_ensembl].to_adata()
data_train1

AnnData object with n_obs × n_vars = 1648 × 749
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'cell_type_ontology_term_id_colors', 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

Train the model on this first file:

scvi.model.SCVI.setup_anndata(data_train1)
vae = scvi.model.SCVI(data_train1)
vae.train(max_epochs=1)  # we use max_epochs=1 to run it on CI
vae.save("saved_models/scvi1")

Load the second file and resume training the model:

data_train2 = file2.load().raw[:, shared_genes_ensembl].to_adata()
vae = scvi.model.SCVI.load("saved_models/scvi1", data_train2)
vae.train(max_epochs=1)
vae.save("saved_models/scvi1", overwrite=True)

Save the model#

weights = ln.File("saved_models/scvi1/model.pt", description="My trained model")
weights.save()

Save latent representation as a new dataset#

latent1 = vae.get_latent_representation(data_train1)
latent2 = vae.get_latent_representation(data_train2)

adata_latent1 = ad.AnnData(X=latent1, obs=data_train1.obs)
adata_latent2 = ad.AnnData(X=latent2, obs=data_train2.obs)

INFO

 Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

Because the latent representation is low-dimensional, we can typically fit very high number of observations into memory.

Hence, let’s store it as a concatenated adata.

adata_latent = ad.concat([adata_latent1, adata_latent2])

dataset_v2_latent = ln.Dataset(
    adata_latent,
    name="Latent representation of scRNA-seq dataset v2",
    description="For the original data, see dataset T5x0SkRJNviE0jYGbJKt",
)
dataset_v2_latent.save()

Let us look at the data flow:

dataset_v2_latent.view_flow()

https://d33wubrfki0l68.cloudfront.net/50568ead66a3826366f3352f966042d6816edcd1/74f28/_images/a5c3a198904ef86cd318193eafc75f23475c45a303c0ec81f62cb2767bd248fa.svg

Compare this with the model:

weights.view_flow()

https://d33wubrfki0l68.cloudfront.net/c85e1f85e89e829f2a5ac48c1d53034637234a58/c9708/_images/01f39fa45ec6c202e35793a712c9e8e7b85225305b4cea874a572f18e5308b55.svg

Annotate with labels:

dataset_v2_latent.labels.add_from(dataset_v2)

dataset_v2_latent.describe()

Dataset(id=3, uid='jYMFlK9mNGmFDeil1JMv', name='Latent representation of scRNA-seq dataset v2', description='For the original data, see dataset T5x0SkRJNviE0jYGbJKt', hash='1iz3mtUQx29dqxHtnJnY1A', updated_at=2023-10-16 21:48:28)

Provenance:
  💫 transform: Transform(id=5, uid='Qr1kIHvK506rz8', name='Iteratively train an ML model on a dataset', short_name='scrna5', version='0', type=notebook, updated_at=2023-10-16 21:48:20, created_by_id=1)
  👣 run: Run(id=5, uid='W2pBST5UpRtEsWPgpTEn', run_at=2023-10-16 21:48:20, transform_id=5, created_by_id=1)
  📄 file: File(id=5, uid='jYMFlK9mNGmFDeil1JMv', suffix='.h5ad', accessor='AnnData', description='See dataset jYMFlK9mNGmFDeil1JMv', size=220226, hash='1iz3mtUQx29dqxHtnJnY1A', hash_type='md5', updated_at=2023-10-16 21:48:28, storage_id=1, transform_id=5, run_id=5, created_by_id=1)
  👤 created_by: User(id=1, uid='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-10-16 21:45:55)
Features:
  external: FeatureSet(id=10, uid='7QNG1GjLqcCsXb5Z10rf', n=5, registry='core.Feature', hash='Rc0k4cM-byrLP221ZkvF', updated_at=2023-10-16 21:48:29, modality_id=2, created_by_id=1)
    🔗 cell_type (39, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 species (1, bionty.Species): 'human'
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (39, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna

💡 deleting instance testuser1/test-scrna
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env

✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna