Metadata Annotation
2 min
raw sample metadata were obtained from the gene expression omnibus (geo) each sample was annotated into three categories of interest biopsy site, cell type, and disease state annotation was performed by matching substrings in the raw metadata to a manually curated list of target values defined for each category substring matching and category assignment the annotation process relied on substring matching as a rule based approach if the presence of a particular substring in the metadata was determined to consistently and unambiguously indicate membership in a given category, then all samples containing that substring were assigned to that category for example, if a specific substring was associated exclusively with one biopsy site, every sample metadata entry containing that substring was annotated accordingly this approach ensured consistent mapping across large datasets while maintaining transparency of the decision rules to evaluate the accuracy of these semi manual curations, field professionals manually curated the available biopsy site, cell type, and disease categorization of a total of 28,066 samples accounting for 1 4% of all samples included in mavatar discovery as of october 2025 the results showed accurate categorization of 94 5% (23,748 of 25,128) biopsy site, 97 6% (4,396 of 4,503) cell type, and 95 6% (6,111 of 6,393) disease annotations, indicating overall high performance of category assignment for further improvements on these categorizations, an ai based approach was implemented to retrieve the corresponding information from geo every case where the output from the two approaches differed was manually curated by field professionals semi automated metadata harmonization to retrieve further metadata information for data included in patient stratification analyses, a semi automated approach was implemented harmonizing the metadata fields across datasets, followed by manual approval or modification by field professionals