Data quality

5 min

how do you handle data quality? there is a lot of published data that is not good/wrong/fabricated do you do any quality control (qc) to remove this noise? how can i trust the results? / geo is full of poorly designed experiments, no controls, replicates, underpowered studies how do you filter these out? when researchers publish transcriptomic data, they are required to deposit it in repositories like the gene expression omnibus (geo) as of june 2025, geo contains over 7 2 million transcriptomes, an extraordinary resource, but one that comes with well known challenges public data is often messy, inconsistently annotated, and variable in quality some studies lack proper controls or sufficient replicates, others are underpowered, and in rare cases data may even be fabricated simply downloading everything from geo and building networks on top of it would produce unreliable results, which is why raw data volume alone is not what makes the platform valuable, but the curation and quality control process is our approach to curation and qc before any dataset contributes to a mavatar discovery network, it passes through quality control analyses designed to filter out noise, technical artifacts, and unreliable signal this includes evaluating data integrity, consistency, and statistical robustness at multiple stages of the pipeline the goal is to ensure that the final network models represent only statistically significant, reproducible gene gene correlations, not artifacts from a single poorly designed study or a batch effect from one laboratory integration itself is a form of quality control beyond explicit qc steps, the large scale integration approach provides an inherent layer of protection against bad quality or fabricated data our networks are not built from any single dataset, they are derived from the aggregated signal across many independent studies, laboratories, and patient cohorts a fabricated or low quality dataset represents one data point among potentially hundreds contributing to a given tissue network for a spurious signal from one bad study to meaningfully affect the network topology, it would need to be consistent with patterns observed across many other independent datasets, which, by definition, fabricated or artifactual signal is not genuine biological correlations are reinforced by replication across studies; noise and fabrication are diluted by it this means that even in the rare event that a problematic dataset passes our qc filters, its impact on the final network is negligible because its signal will not be corroborated by the broader data traceability lets you verify what you see unlike tools built on literature curation or other black box scoring systems, mavatar discovery lets you trace interactions back to the contributing data sources through the functional annotation card this means you can see which datasets and patient cohorts support a given edge in your network, providing a layer of transparency that lets you evaluate the provenance of any interaction yourself if an edge is supported by multiple independent studies across different laboratories and patient groups, you can have high confidence in it if it rests on a narrow data foundation, you can flag it for further validation summary we don't assume public data is clean, we assume it isn't, and we build accordingly rigorous curation, statistical quality control, and the inherent noise dampening effect of large scale multi study integration work together to ensure that the biology you see in mavatar discovery reflects real, reproducible signal rather than the artifacts and inconsistencies that inevitably exist in public repositories how do you handle batch effects across different studies? / you are mixing datasets from different platforms how do you handle platform specific biases? batch effects and platform specific biases are among the biggest challenges in integrating transcriptomic data from multiple sources if technical biases are not accounted for, they can introduce false correlations into gene networks two genes might appear co expressed not because they are biologically co regulated, but because they are similarly affected by a platform specific artifact we address platform specific biases through normalization and standardization at the data level, and through consensus based signal detection at the network level before datasets enter our network building pipeline, we apply standardization and normalization approaches designed to bring expression values onto a comparable scale across studies and platforms this addresses systematic differences in expression magnitude and distribution that arise from different sequencing depths, library preparation methods, or quantification approaches beyond normalization, our correlation analyses are designed to capture gene to gene relationships that are consistent across multiple independent datasets, not patterns that are driven by a single study or platform a batch effect is specific to one dataset or one laboratory, and it will not replicate across studies performed by different groups using different protocols a real biological correlation, on the other hand, will appear repeatedly across independent datasets by requiring signal to be statistically significant across the integrated data, we prioritize reproducible biology over study specific technical artifacts what journals have accepted papers using your platform? mavatar discovery launched at the end of 2025, so we are still in the early stages of building our publication track record that said, the platform is already being used in active research collaborations with sobi https //www mavatar com/posts/mavatar enters research collaboration with sobi , pfizer (within the vinnova project https //www mavatar com/posts/mavatar delivers new data driven resources to advance precision medicine for ibd ), and diamyd, among others several of these collaborations are targeting high impact journals, but as you know, academic publishing timelines can be slow we will also be presenting abstracts at several conferences before summer in the meantime, we're happy to share case studies docid\ u h4cngkysfpxzqyjyf5p that demonstrate the platform's capabilities, and we can provide abstracts and publications as they become available how do you normalize the data? our normalization procedures are designed to ensure comparability across datasets from different studies, laboratories, and sequencing platforms that said, specific implementation details including the exact pipeline, parameter configurations, and proprietary optimizations are part of our protected intellectual property this is standard practice for commercial bioinformatics platforms what we can tell you is that every normalization decision has been validated by our r\&d (research and development) team to ensure that the resulting networks reflect genuine biological signal rather than technical artifacts if you have specific methodological questions for a publication or review context, our team is available to discuss what can be disclosed do you check for sample swaps or contamination? our quality control pipeline includes checks designed to identify anomalous samples that behave inconsistently with their annotated metadata, which can result from sample swaps, contamination, or mislabeling samples that fail these checks are excluded from network construction and other downstream analyses including conditions expression charts and patient stratification additionally, the same principle that protects against other data quality issues applies here even if a swapped or contaminated sample passes qc, it represents one sample among thousands its aberrant signal will not replicate across independent datasets and therefore cannot meaningfully influence the final network topology the genuine biological correlations, supported by correctly processed samples across many studies, will dominate has your platform been peer reviewed or validated? the platform is new to market, so we don't yet have a dedicated platform methods paper in a peer reviewed journal however, the underlying science including co expression network biology, the statistical frameworks we use, and the analytical approaches are grounded in well established, peer reviewed methodologies internally, every component of the pipeline has been validated by our r\&d team against known biological benchmarks, established gene interaction datasets, and curated pathway databases to ensure that the networks produce biologically meaningful and reproducible results on the external validation front, we have active research collaborations with sobi https //www mavatar com/posts/mavatar enters research collaboration with sobi , pfizer (within the vinnova project https //www mavatar com/posts/mavatar delivers new data driven resources to advance precision medicine for ibd ), and diamyd, several of which are targeting high impact publications as these collaborations produce results, peer reviewed publications using the platform will follow we are also presenting abstracts at several conferences, which will provide additional external scrutiny of the platform's outputs we recognize that peer reviewed validation is important and it is actively underway in the meantime, we welcome direct scientific discussion about our methodology our team is available to walk through the validation evidence with you does mavatar discovery account for multiple testing? what is your correction strategy? for the functional enrichment card, p values are calculated using the hypergeometric distribution and adjusted for multiple testing using the benjamini hochberg (bh) correction to control the false discovery rate for edge level p values in the network, the current version (version 1) does not apply a multiple testing correction this is something we are implementing in future versions in the meantime, users can apply more stringent p value and t value filters when exploring the network to focus on the highest confidence interactions, and the fact that edges must be supported by consistent signal across multiple independent datasets provides an additional layer of robustness beyond the p value alone how do you handle missing values? the specific approach to handling missing values is part of our proprietary pipeline what we can say is that missing data is accounted for during preprocessing to ensure that it does not introduce false correlations or bias into the network models our r\&d team has validated this process to confirm that the treatment of missing values does not compromise the statistical integrity of the resulting networks if you have specific concerns about how missing data might affect a particular analysis or tissue network, feel free to reach out to our team for a more detailed discussion how do you test for false positives? what is your precision recall? do you have any “gold” standard networks? in the field of gene interaction networks there is no universally agreed upon gold standard existing databases like string, kegg, or biogrid capture known interactions, but they are inherently biased toward well studied genes and pathways therefore, using them as a gold standard would penalize any platform that discovers novel interactions, which is precisely what a data driven approach is designed to do that said, we do validate our networks we benchmark against established biological knowledge, confirming that known pathway genes cluster together, that well characterized gene interactions are captured, and that functional enrichment on network modules returns biologically coherent results these serve as sanity checks that the networks are producing meaningful biology rather than noise formal precision recall metrics in the traditional sense are difficult to report for a platform like this, because the "true negative" set or gene pairs that definitively do not interact in each tissue essentially does not exist what appears to be a false positive today may be a novel interaction that hasn't been validated yet our practical approach to controlling false positives relies on four layers statistical thresholding, quality control, multi study replication, and the requirement that signal is consistent across independent datasets as a user, you can further increase stringency by filtering on t value and p value to focus on the highest confidence edges this is an area of ongoing development, and as the platform matures and more collaborations produce experimentally validated results, our ability to report formal performance metrics will grow with it if you'd like to discuss validation in more detail, our r\&d team is happy to walk through our benchmarking approach are your networks directional? our networks are not directional in the sense of showing regulatory directionality (i e , gene a activates gene b) they report on positive correlations or genes whose expression levels increases together across the datasets contributing to the network this means the edges represent co expression relationships rather than causal or regulatory hierarchy two connected genes are consistently co upregulated across samples, but the network does not infer which gene is driving the other for regulatory directionality, you can complement your mavatar discovery findings with external tools or experimental validation designed to establish causal relationships