Data Pre-Processing

3 min

a key strength of mavatar discovery is its systematic integration of diverse datasets and samples, enabling comprehensive network construction across multiple biological contexts to ensure methodological rigor and reproducibility, extensive efforts were made to incorporate publicly available data whenever applicable, thereby maximizing coverage while minimizing potential biases however, to enhance robustness and reliability, poor quality datasets containing less than 15 samples or less than 100 genes were excluded to further increase the quality of downstream analyses, samples and genes with extremely low expression variations or a high degree of technical dropouts were filtered out each dataset was normalized and scaled accordingly pseudobulk analysis refers to the construction of aggregated expression profiles by averaging single cell transcript counts within a defined sample or cell type group this strategy enables downstream analyses to be conducted using frameworks developed for bulk rna seq, while simultaneously preserving single cell resolution at the level of the chosen grouping the pseudobulk data included in the networks are based on publicly available single cell transcriptomics, pre processed and re annotated based on our in house methods for each patient/healthy sample in each dataset, the retrieved expression matrix has been pre processed using seurat, version 4 1 3 ( 1 docid\ xizqptdqqjnwo 5bz3gd3 ), removing low quality cells with less than 200 expressed genes, or a mitochondrial gene detection higher than 25% genes detected in less than three cells were also excluded the data was then normalized and scaled for further dimensional reduction, followed by clustering analysis cell type re annotation was performed systematically each cell was annotated against two references using singler, version 2 4 1 ( 2 docid\ xizqptdqqjnwo 5bz3gd3 ) a consensus major (e g , t lymphocytes) and minor (e g , th1 cells, th2 cells cell type was assigned to each cell based on overlapping results between the two references cells with inconclusive annotation were further considered as “unknown” our annotation method for major cell types was compared to annotated public datasets showing an 83% median concordance across five whole dataset comparisons the pseudobulk data was then constructed by calculating the average expression for each minor degree cell type within each sample a data specific cut off for minimum number of cells to be included for stable pseudobulk construction was applied, based on extensive internal tests of reproducibility and stability in cases where too few cell types passed these criteria, leading to too few pseudobulk samples for downstream correlation analysis, the pseudobulk production was based on the major degree cell type annotation pseudobulk expression matrices were created for both patient specific (averaged by cell type) and cell type specific (averaged by patient) expression profiles to enable consistent downstream analysis, gene identifiers were standardized specifically, all gene names were systematically mapped to ensembl ids, providing a unified and stable reference framework that facilities cross platform integration for microarray data, the platform specific annotation was first used to translate the probe names to gene names, followed by gene names translation to ensembl ids using the species specific gtf file downloaded from ensembl, release 113 ( 3 docid\ xizqptdqqjnwo 5bz3gd3 , 4 docid\ xizqptdqqjnwo 5bz3gd3 ) the total number of datasets and samples included in each network construction can be found under models inside mavatar discovery https //discovery mavatar com references d aran, a p looney, l liu, e wu, v fong, a hsu, s chak, r p naikawadi, p j wolters, a r abate, a j butte, m bhattacharya, reference based analysis of lung single cell sequencing reveals a transitional profibrotic macrophage nat immunol 20, 163–172 (2019) (available at https //doi org/10 1016/j cell 2021 04 048 https //doi org/10 1016/j cell 2021 04 048 ) y hao, s hao, e andersen nissen, w m mauck, s zheng, a butler, m j lee, a j wilk, c darby, m zager, p hoffman, m stoeckius, e papalexi, e p mimitou, j jain, a srivastava, t stuart, l m fleming, b yeung, a j rogers, j m mcelrath, c a blish, r gottardo, p smibert, r satija, integrated analysis of multimodal single cell data cell 184, 3573 3587 e29 (2021) (available at https //www nature com/articles/s41590 018 0276 y https //www nature com/articles/s41590 018 0276 y ) ensembl genome browser 115 (2024) (available at https //mart ensembl org https //mart ensembl org ) s c dyer, o austine orimoloye, a g azov, m barba, i barnes, v p barrera enriquez, a becker, r bennett, m beracochea, a berry, j bhai, s k bhurji, s boddu, p r branco lins, l brooks, s b ramaraju, l i campbell, m c martinez, m charkhchi, l a cortes, c davidson, s denni, k dodiya, s donaldson, b el houdaigui, t el naboulsi, o falola, r fatima, t genez, j g martinez, t gurbich, m hardy, z hollis, t hunt, m kay, v kaykala, d lemos, d lodha, n mathlouthi, g a merino, r merritt, l p mirabueno, a mushtaq, s n hossain, j g pérez silva, m perry, i piližota, d poppleton, i prosovetskaia, s raj, a i a salam, s saraf, n saraiva agostinho, s sinha, b sipos, v sitnik, e steed, m m suner, l surapaneni, k sutinen, f f tricomi, i tsang, d urbina gómez, a veidenberg, t a walsh, n l willhoft, j allen, j alvarez jarreta, m chakiachvili, j cheema, j b da rocha, n h de silva, s giorgetti, l haggerty, g r ilsley, j keatley, j e loveland, b moore, j m mudge, g naamati, j tate, s j trevanion, a winterbottom, b flint, a frankish, s e hunt, r d finn, m a freeberg, p w harrison, f j martin, a d yates, ensembl 2025 nucleic acids res 53, d948–d957 (2025) (available at https //academic oup com/nar/article/53/d1/d948/7916352 https //academic oup com/nar/article/53/d1/d948/7916352 )