Raw sample metadata were obtained from the Gene Expression Omnibus (GEO). Each sample was annotated into three categories of interest: biopsy site, cell type, and disease state. Annotation was performed by matching substrings in the raw metadata to a manually curated list of target values defined for each category.

Substring Matching and Category Assignment

The annotation process relied on substring matching as a rule-based approach. If the presence of a particular substring in the metadata was determined to consistently and unambiguously indicate membership in a given category, then all samples containing that substring were assigned to that category. For example, if a specific substring was associated exclusively with one biopsy site, every sample metadata entry containing that substring was annotated accordingly. This approach ensured consistent mapping across large datasets while maintaining transparency of the decision rules.

To evaluate the accuracy of these semi-manual curations, field professionals manually curated the available biopsy site, cell type, and disease categorization of a total of 28,066 samples accounting for 1.4% of all samples included in Mavatar Discovery as of October 2025. The results showed accurate categorization of 94.5% (23,748 of 25,128) biopsy site, 97.6% (4,396 of 4,503) cell type, and 95.6% (6,111 of 6,393) disease annotations, indicating overall high performance of category assignment. For further improvements on these categorizations, an AI based approach was implemented to retrieve the corresponding information from GEO. Every case where the output from the two approaches differed was manually curated by field professionals. 

To retrieve further metadata information for data included in patient stratification analyses, a semi-automated approach was implemented harmonizing the metadata fields across datasets, followed by manual approval or modification by field professionals.

Metadata Annotation

Sample Selection for Network Analyses

Expression Data Collection

Data Pre-Processing

Data Collection and Pre-Processing

Discovery

Mavatar Discovery - Documentation & Learning Hub

What is Mavatar Discovery?

Basic Search

Advanced Search

Generate Graph

Network Core Genes and Mavatar Curated Gene Lists

DINA Network

Functional Enrichment

Edge Information

Functional Annotation

Gene Information

Conditions Expression Chart

DINA Network Similarity

Patient Stratification

Cell Type Explorer

Network cards

Shadow and Highlight Network Genes

Save

Tabs

Network and Canvas Management

Network Management and Functions

Tools

Standard - Correlation-Based

AI - Autoencoder

DINA Network Construction

Conditions Expression Analysis

DINA Network Similarity Analysis

Expression Heatmap Construction

UMAP Construction

Mavatar Discovery Tools – Scientific Background

Network Distribution and Modularity Analyses

Network Comparability to Known Interactions

Biological Relevance

DINA Network Similarity Validations

AI Network Validations

AI- or Correlation-Based Networks

DINA Network Evaluations

Version Information

Background Analyses and Approaches

Video Tutorials

Webinars

Case Studies

Cite Mavatar Discovery

Guides and Resources

Mavatar Discovery

Interpretation and biology

Functionalities

Data quality

Metadata

Reproducibility

Value proposition

Data Governance and Privacy