What happens when datasets in GEO have mislabeled samples or incorrect metadata?

Mislabeled samples and incorrect metadata are a known reality in public repositories. We address this at multiple levels.

During curation, we review and standardize metadata, which helps catch inconsistencies, for instance, a sample labeled as healthy control but annotated with disease-associated experimental conditions, or contradictory information between the study description and the sample-level metadata.

At the network level, the same principle that protects against batch effects applies here. A mislabeled sample introduces noise, but it is one sample among thousands. Our networks are built on correlations that must be statistically significant across large integrated datasets, so a handful of mislabeled samples cannot drive a gene-gene interaction into the final network on their own. The genuine biological signal, replicated across correctly labeled samples from many independent studies, will dominate.

That said, metadata quality in GEO is an ongoing challenge across the entire field, and no curation pipeline can guarantee that every label is correct. This is why traceability matters, and through the Functional Annotation card, you can see which datasets and cohorts support each interaction, giving you the ability to evaluate data provenance and flag anything that looks inconsistent with your biological interpretations.

How can I get information on the data contributing to the graph?

Mavatar Discovery gives you full transparency into what's behind your network through two main tools.

DINA Network card: This provides a high-level overview of the data foundation underlying your network, including the number of samples and studies contributing to it. This is your starting point for understanding the scale and scope of the data behind what you're seeing on screen.

Functional Annotation card: Using the drop-down menu, you can group the data by several categories: Data Source (which shows the raw GEO Series IDs), Biopsy Sites, Cell Types, Diseases, or Other Tags. Each view plots the average weighted correlation coefficient (AWCC) for datasets within that grouping, giving you a picture of how different data slices contribute to the overall network. When you select an individual edge, its AWCC is highlighted within the plot, so you can assess how that specific interaction is supported across different data categories. Clicking on any group category provides more detailed information on the top contributing datasets, letting you trace an interaction all the way back to the studies and sample cohorts that produced it. 

Data quality

Metadata

Reproducibility

Discovery

Mavatar Discovery - Documentation & Learning Hub

What is Mavatar Discovery?

Basic Search

Advanced Search

Generate Graph

Network Core Genes and Mavatar Curated Gene Lists

DINA Network

Functional Enrichment

Edge Information

Functional Annotation

Gene Information

Conditions Expression Chart

DINA Network Similarity

Patient Stratification

Cell Type Explorer

Network cards

Shadow and Highlight Network Genes

Save

Tabs

Network and Canvas Management

Network Management and Functions

Tools

Metadata Annotation

Sample Selection for Network Analyses

Expression Data Collection

Data Pre-Processing

Data Collection and Pre-Processing

Standard - Correlation-Based

AI - Autoencoder

DINA Network Construction

Conditions Expression Analysis

DINA Network Similarity Analysis

Expression Heatmap Construction

UMAP Construction

Mavatar Discovery Tools – Scientific Background

Network Distribution and Modularity Analyses

Network Comparability to Known Interactions

Biological Relevance

DINA Network Similarity Validations

AI Network Validations

AI- or Correlation-Based Networks

DINA Network Evaluations

Version Information

Background Analyses and Approaches

Video Tutorials

Webinars

Case Studies

Cite Mavatar Discovery

Guides and Resources

Mavatar Discovery

Interpretation and biology

Functionalities

Value proposition

Data Governance and Privacy