SUBRATA SAHA
MY RESEARCH
My research interests focus on designing and developing novel sequential and parallel algorithms, data structures, and data analysis techniques for big data analytics. We live in a period when voluminous datasets get generated in every walk of life. It is essential to evaluate and extract useful information from large and complex datasets, e.g. biomedical, biological, and text data. I have worked on some of the fundamental problems arising in big data analytics and invented novel algorithms that outperformed the best prior algorithms. These methods have been published in top-notch journals and conferences, such as Bioinformatics, BMC Bioinformatics, BMC Genomics, BMC Human Genomics, ACM BCB, IEEE BIBM, IEEE AINA, IEEE ICDM, ACM CIKM, among others. A list of selected publications can be found in my curriculum vitae.
At present, I am working as an associate research scientist at Irving Medical Center in Columbia University to acquire comprehensive knowledge in biology so that I can design and develop more accurate and pragmatic algorithms in the fields of life sciences and health informatics. In this capacity, I have developed and optimized machine learning and data analytics algorithms to elucidate the complex genetics of rectal neuroendocrine tumors (RNETs), a rare and aggressive form of colorectal cancer. This is the first time I have identified that there are strong inherited components in RNETs with experimental confirmations. Furthermore, I have demonstrated that we can predict future RNET-susceptibility of an individual based on ensemble machine learning algorithms with a very high level of accuracy (actually, 100% in our completely separated validation cohort) by examining a few rare and nonsynonymous germline variants. These findings will shape the clinical management and open the avenue for identifying novel therapeutic drug targets for RNETs. In addition, I have developed algorithms for reconstructing TF-target networks, variant collapsing analytics, GWAS multi-locus analytics, and gene co association analytics based on graph theory and machine learning techniques.
Previously, I was working as a postdoctoral researcher in computational genomics group at IBM T.J. Watson Research Center of IBM Research in Yorktown Heights, New York where I have developed efficient and scalable machine learning algorithms and big data analysis techniques to decipher biology of Alzheimer’s, Parkinson’s, and COVID-19 disease from genome sequences including both DNA and RNA sequence as well as other post-genomic data such as, genomic DNA microarrays data, RNA expression data, biological pathway, and protein-protein interaction networks. Both Alzheimer's and Parkinson's are chronic neurodegenerative disease that usually starts slowly and worsens over time. Next, I outline my theoretical and practical contributions in the fields of biological sequence modeling and analysis, bioinformatics, computational biology, and data mining.
BIOLOGICAL BIG DATA ANALYSIS
The biological data is big and its scale has already been well beyond petabyte even exabyte. Therefore, efficient and scalable computational techniques are essential to translate massive amount of information into a better understanding of the basic biomedical mechanisms. Considering these facts, I have developed novel algorithms and data analytics methods for solving several fundamental problems in the fields of bioinformatics and computational biology including biological sequence error correction, compression and assembly, complex biological network (e.g. biological pathways, protein-protein interactions, and gene co-expression) analysis, and haplotype reconstruction and phasing. These algorithms have been published in Bioinformatics, BMC Bioinformatics, BMC Genomics, ACM BCB, and IEEE BIBM.
BIOLOGICAL BIG DATA CLASSIFICATION
There are multiple ways to categorize nearly everything in biology. Classification enables systems-level analysis of large biological data sets as well as automation. Unfortunately, classification task is increasingly challenged by the sheer volume of biological data. Bearing these facts in mind, I have designed and implemented a set of efficient, scalable, and robust classification algorithms in a varied domain of computational biology, such as metagenomic sequence classification, supervised feature selection, 2-locus and 3-locus problems in genome-wide association study, and genome-wide spliced junction discovery. These research works have been published in Bioinformatics, BMC Human Genomics, BMC MIDM, ACM BCB, IEEE BIBM, ACM CIKM, and IEEE ICDM.
BIG DATA MINING AND PATTERN RECOGNITION
The most fundamental challenge for big data applications is to explore the large volumes of data and extract useful information for future actions. In many situations, the data mining and pattern recognition procedures must be very efficient and close to real time because storing all observed data is nearly infeasible. I have worked on a few fundamental problems arises in the big data analytics, such as clustering, data reduction, closest pair detection, mining similar pairs of points, feature selection, time-series motif mining, autoconfiguration, and broadcasting. These research works have been published in IEEE ICDM, ADMA, DMIN, IEEE AINA, Ad-Hoc Now, JNSM and IJFCS.