About
Data Collection and Pre-Processing
7 min
metadata annotation raw sample metadata were obtained from the gene expression omnibus (geo) each sample was annotated into three categories of interest biopsy site, cell type, and disease state annotation was performed by matching substrings in the raw metadata to a manually curated list of target values defined for each category substring matching and category assignment the annotation process relied on substring matching as a rule based approach if the presence of a particular substring in the metadata was determined to consistently and unambiguously indicate membership in a given category, then all samples containing that substring were assigned to that category for example, if a specific substring was associated exclusively with one biopsy site, every sample metadata entry containing that substring was annotated accordingly this approach ensured consistent mapping across large datasets while maintaining transparency of the decision rules to evaluate the accuracy of these semi manual curations, field professionals manually curated the available biopsy site, cell type, and disease categorization of a total of 28,066 samples accounting for 1 4% of all samples included in mavatar discovery as of october 2025 the results showed accurate categorization of 94 5% (23,748 of 25,128) biopsy site, 97 6% (4,396 of 4,503) cell type, and 95 6% (6,111 of 6,393) disease annotations, indicating overall high performance of category assignment sample selection for network analyses through deep integrated network analysis (dina), framework integrating comprehensive data sources, researchers can gain deeper insights into complex biological systems in a fully data driven manner for this, dina networks representing specific biological contexts were constructed by integration of all samples representative of the given context i e curating substrings queried against the full set of raw metadata (see docid\ aqwp0r sg1opiwvvltcbu for details) all samples containing the queried substrings were automatically included in the corresponding analysis this approach allowed efficient retrieval of relevant samples and ensured that inclusion criteria were directly traceable to metadata based rules expression data collection expression data was obtained from gene expression omnibus (geo) microarray datasets were extracted from geo family soft files using in house tools, and probe identifiers were mapped using platform annotation tables from the corresponding platform soft files rna seq data was obtained (if made available by ncbi) as raw counts matrices mapped by ncbi’s pipeline ( docid\ aqwp0r sg1opiwvvltcbu ) genes were translated to ensembl id using information from ncbi (as of 24 april 2024) and ensembl (v113) ( docid\ aqwp0r sg1opiwvvltcbu ) datasets from single cell studies were identified and downloaded manually gene identifier translation was performed as explained in the docid\ aqwp0r sg1opiwvvltcbu section data pre processing a key strength of mavatar discovery is its systematic integration of diverse datasets and samples, enabling comprehensive network construction across multiple biological contexts to ensure methodological rigor and reproducibility, extensive efforts were made to incorporate publicly available data whenever applicable, thereby maximizing coverage while minimizing potential biases however, to enhance robustness and reliability, poor quality datasets containing less than 15 samples or less than 100 genes were excluded to further increase the quality of downstream analyses, samples and genes with extremely low expression variations or a high degree of technical dropouts were filtered out each dataset was normalized and scaled accordingly pseudobulk analysis refers to the construction of aggregated expression profiles by averaging single cell transcript counts within a defined sample or cell type group this strategy enables downstream analyses to be conducted using frameworks developed for bulk rna seq, while simultaneously preserving single cell resolution at the level of the chosen grouping the pseudobulk data included in the networks are based on publicly available single cell transcriptomics, pre processed and re annotated based on our in house methods for each patient/healthy sample in each dataset, the retrieved expression matrix has been pre processed using seurat, version 4 1 3 ( docid\ aqwp0r sg1opiwvvltcbu ), removing low quality cells with less than 200 expressed genes, or a mitochondrial gene detection higher than 25% genes detected in less than three cells were also excluded the data was then normalized and scaled for further dimensional reduction, followed by clustering analysis cell type re annotation was performed systematically each cell was annotated against two references using singler, version 2 4 1 ( docid\ aqwp0r sg1opiwvvltcbu ) a consensus major (e g , t lymphocytes) and minor (e g , th1 cells, th2 cells cell type was assigned to each cell based on overlapping results between the two references cells with inconclusive annotation were further considered as “unknown” our annotation method for major cell types was compared to annotated public datasets showing an 83% median concordance across five whole dataset comparisons the pseudobulk data was then constructed by calculating the average expression for each minor degree cell type within each sample a data specific cut off for minimum number of cells to be included for stable pseudobulk construction was applied, based on extensive internal tests of reproducibility and stability in cases where too few cell types passed these criteria, leading to too few pseudobulk samples for downstream correlation analysis, the pseudobulk production was based on the major degree cell type annotation pseudobulk expression matrices were created for both patient specific (averaged by cell type) and cell type specific (averaged by patient) expression profiles to enable consistent downstream analysis, gene identifiers were standardized specifically, all gene names were systematically mapped to ensembl ids, providing a unified and stable reference framework that facilities cross platform integration for microarray data, the platform specific annotation was first used to translate the probe names to gene names, followed by gene names translation to ensembl ids using the species specific gtf file downloaded from ensembl, release 113 ( docid\ aqwp0r sg1opiwvvltcbu , docid\ aqwp0r sg1opiwvvltcbu ) the total number of datasets and samples included in each network construction can be found under models inside mavatar discovery references ncbi generated rna seq count data beta geo ncbi (2024) (available at https //www ncbi nlm nih gov/geo/info/rnaseqcounts html ) s c dyer, o austine orimoloye, a g azov, m barba, i barnes, v p barrera enriquez, a becker, r bennett, m beracochea, a berry, j bhai, s k bhurji, s boddu, p r branco lins, l brooks, s b ramaraju, l i campbell, m c martinez, m charkhchi, l a cortes, c davidson, s denni, k dodiya, s donaldson, b el houdaigui, t el naboulsi, o falola, r fatima, t genez, j g martinez, t gurbich, m hardy, z hollis, t hunt, m kay, v kaykala, d lemos, d lodha, n mathlouthi, g a merino, r merritt, l p mirabueno, a mushtaq, s n hossain, j g pérez silva, m perry, i piližota, d poppleton, i prosovetskaia, s raj, a i a salam, s saraf, n saraiva agostinho, s sinha, b sipos, v sitnik, e steed, m m suner, l surapaneni, k sutinen, f f tricomi, i tsang, d urbina gómez, a veidenberg, t a walsh, n l willhoft, j allen, j alvarez jarreta, m chakiachvili, j cheema, j b da rocha, n h de silva, s giorgetti, l haggerty, g r ilsley, j keatley, j e loveland, b moore, j m mudge, g naamati, j tate, s j trevanion, a winterbottom, b flint, a frankish, s e hunt, r d finn, m a freeberg, p w harrison, f j martin, a d yates, ensembl 2025 nucleic acids res 53, d948–d957 (2025) (available at https //academic oup com/nar/article/53/d1/d948/7916352 ) d aran, a p looney, l liu, e wu, v fong, a hsu, s chak, r p naikawadi, p j wolters, a r abate, a j butte, m bhattacharya, reference based analysis of lung single cell sequencing reveals a transitional profibrotic macrophage nat immunol 20, 163–172 (2019) (available at https //doi org/10 1016/j cell 2021 04 048 ) y hao, s hao, e andersen nissen, w m mauck, s zheng, a butler, m j lee, a j wilk, c darby, m zager, p hoffman, m stoeckius, e papalexi, e p mimitou, j jain, a srivastava, t stuart, l m fleming, b yeung, a j rogers, j m mcelrath, c a blish, r gottardo, p smibert, r satija, integrated analysis of multimodal single cell data cell 184, 3573 3587 e29 (2021) (available at https //www nature com/articles/s41590 018 0276 y ) ensembl genome browser 115 (2024) (available at https //mart ensembl org )