Metadata
1 min
what happens when datasets in geo have mislabeled samples or incorrect metadata? mislabeled samples and incorrect metadata are a known reality in public repositories we address this at multiple levels during curation, we review and standardize metadata, which helps catch inconsistencies, for instance, a sample labeled as healthy control but annotated with disease associated experimental conditions, or contradictory information between the study description and the sample level metadata at the network level, the same principle that protects against batch effects applies here a mislabeled sample introduces noise, but it is one sample among thousands our networks are built on correlations that must be statistically significant across large integrated datasets, so a handful of mislabeled samples cannot drive a gene gene interaction into the final network on their own the genuine biological signal, replicated across correctly labeled samples from many independent studies, will dominate that said, metadata quality in geo is an ongoing challenge across the entire field, and no curation pipeline can guarantee that every label is correct this is why traceability matters, and through the functional annotation card, you can see which datasets and cohorts support each interaction, giving you the ability to evaluate data provenance and flag anything that looks inconsistent with your biological interpretations how can i get information on the data contributing to the graph? mavatar discovery gives you full transparency into what's behind your network through two main tools dina network card this provides a high level overview of the data foundation underlying your network, including the number of samples and studies contributing to it this is your starting point for understanding the scale and scope of the data behind what you're seeing on screen functional annotation card using the drop down menu, you can group the data by several categories data source (which shows the raw geo series ids), biopsy sites, cell types, diseases, or other tags each view plots the average weighted correlation coefficient (awcc) for datasets within that grouping, giving you a picture of how different data slices contribute to the overall network when you select an individual edge, its awcc is highlighted within the plot, so you can assess how that specific interaction is supported across different data categories clicking on any group category provides more detailed information on the top contributing datasets, letting you trace an interaction all the way back to the studies and sample cohorts that produced it