SUBRATA SAHA
PROJECTS
GENETIC CHARACTERIZATION OF RECTAL NEUROENDOCRINE TUMOURS
Rectal neuroendocrine tumours (RNETs) represent a clinically relevant yet poorly characterized subtype of gastroenteropancreatic neuroendocrine tumours (GEP-NET), with a virtually uncharted genetic landscape. Extensive genetic characterization of rectal NET samples collected by 17 collaborative centers (International NET consortium) revealed a distinctive, yet remarkably recurrent mutational profile. Specifically, these tumours are characterized by presence of large, recurrent chr3, chr10, chr11, chr13, and chr18 deletions, and striking enrichment of nonsynonymous rare variants in pathways responsible for genome integrity and DNA replication fidelity, extracellular matrix organization, glycosylation, and inflammasomes, among others. There is also strong germline predisposition in rectal NETs. We have collected blood samples from 31 RNET patients and identified statistically significant 40 germline variants predisposed in RNET samples. We have also experimentally confirmed 6 germline variants predisposed in both blood-based and tumour-based samples. In addition, we have demonstrated that the germline variants together have perfect prediction power that can be used as an early screening tool to assess the future risk of RNET-susceptibility of an individual. These findings will shape the clinical management and open the avenue for identifying novel therapeutic drug targets for RNETs. Taken together, these data support a common and uniquely distinct genetic aetiology for rectal NETs and suggest a strong germline predisposition component.
In summary, I have performed a set of computational analysis, such as ethnicity prediction and matched controls selection, rare and common variant association study, rare and nonsynonymous variant collapsing analysis, pathway and weighted network of pathways analysis, ensemble machine learning based predictivity analysis along with pinpointing feature importance, GO and NCG term analysis, disease pathogenicity and regulatory variant analysis, among others.
GWAS MULTI-LOCUS ANALYSIS BASED ON GRAPH THEORY AND MACHINE LEARNING
Current form of genome-wide association studies (GWAS) is inadequate to accurately explain the genetics of complex traits due to the lack of sufficient statistical power. It explores each variant individually, but current studies show that multiple variants with varying effect sizes actually act in a concerted way to develop a complex disease. To address this issue, we have developed an algorithmic framework that can effectively solve the multi-locus problem in GWAS with a very high level of confidence. Our methodology consists of three novel algorithms based on graph theory and machine learning. It identifies a set of highly discriminating variants that are stable and robust with little (if any) spuriousness. Consequently, likely these variants should be able to interpret missing heritability of a convoluted disease as an entity. Some of the main objectives of this work include but not limited to: (1) not to use any type of raw p-value threshold; (2) reduce variants with minimal information loss; (3) identify a subset of discriminating variants that are stable and robust across multiple bootstrapped samples; and (4) instrument a theoretical framework to assess the robustness and statistical significance of a machine learning model of interest.
As stated above, our computational methodology consists of three interconnected algorithms to accomplish three fundamental objectives. At the beginning, we retain unique variants and discard duplicate ones in terms of their values across the samples. We then select a set of representative variants using linkage disequilibrium measures. Finally, a subset of highly discriminating variants is identified using a robust and stable machine learning algorithm.
To demonstrate the efficacy of our proposed algorithms, we have considered astigmatism case-control GWAS dataset. Astigmatism is a common eye condition that causes blurred vision because of an error in the shape of the cornea. The cause of astigmatism is not entirely known but a sizable inheritability is assumed. Clinical studies show that developmental disorders (such as, autism) and astigmatism co-occur in a statistically significant number of individuals. By performing classical GWAS analysis, we didn't find any genome-wide statistically significant variants. Conversely, we have identified a set of stable, robust, and highly predictive variants that can together explain the genetics of astigmatism. We have performed a set of biological enrichment analyses based on gene ontology (GO) terms, disease ontology (DO) terms, biological pathways, network of pathways, and so forth to manifest the accuracy and novelty of our findings.
Rigorous experimental evaluations show that our proposed methodology can solve GWAS multi-locus problem effectively and efficiently. It can identify signals from GWAS dataset having small number of samples with a high level of accuracy. We believe that the proposed methodology based on graph theory and machine learning is the most comprehensive one compared to any other machine learning based tools in this domain.
METAGENOMIC SEQUENCE CLASSIFICATION
Metagenomics is a strong tool that can be used to decipher microbial communities by directly sampling genetic materials from their natural habitats. Therefore, it can be applied in a wide variety of fields to solve practical challenges. Application areas include but are not limited to biofuels, biotechnology, food safety, agriculture, medications, etc. It can be used for early detection of threats to food production (crop and animal diseases) and food safety (monitoring and early detection of dangerous microbial contaminants). Metagenomic sequencing shows promise as a sensitive and rapid method to diagnose infectious diseases by comparing genetic material found in a patient's sample to a database of thousands of bacteria, viruses, and other pathogens. There is also strong evidence that microbes may contribute to many non–infectious chronic diseases such as some forms of cancer, coronary heart disease, and diabetes. Thus, metagenomics offers to decipher the role of the human microbiome (the collective genome of our symbionts) in health and disease in individuals and populations, and the development of novel diagnostic and treatment strategies based on this knowledge. Consequently, we need efficient computational method to correctly identify microbes and their abundances in the metagenomic sample.
In metagenomics we study genetic materials of microorganisms (such as, bacteria, virus) recovered directly from environmental samples (such as, human gut, biomass, food). The genetic materials are sequenced to get fragmented short genetic sequences called reads. From this set of reads we need to identify microbes in the sample. Researchers have drafted complete genetic sequences (i.e., genomes) of thousands of different microbes. These reads need to be aligned onto these known reference genomes to detect microorganisms. Computationally it is a very challenging task due to the facts that there exist thousands of such microbial genomes and the metagenomic sample usually contain massive number of reads. Several subsequence-based approaches exist in this domain to determine the identity and relative abundance of microbes in a heterogenous microbial sample. These methods suffer from low classification accuracy, high execution time, and high memory usage. Thus, we need efficient and effective computational technique to quickly and accurately identify microbes present in a metagenomic sample. To address all the issues stated above we offer very fast and highly accurate metagenomics sequence classification algorithms based on identifying unique regions among the microbial genomes. These unique regions are then combined to form a model sequence for each microbe. Instead of using the full genome sequences, these shorter pre-built model sequences are then employed to accurately detect taxonomic ranks and estimate their relative abundances. Experimental results show that the proposed algorithms are indeed more effective and efficient algorithms compared to the other state-of-the-art algorithms in terms of accuracy, memory, and runtime.
ROBUST AND STABLE GENE SELECTION.
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product such as proteins. But in non-protein coding genes (e.g., transfer RNA (tRNA) or small nuclear RNA (snRNA) genes) the product is a functional RNA. The genetic code preserved in a DNA sequence is “interpreted” by gene expression, and the properties of the expression give rise to the organism’s phenotypes (i.e., observable traits such as, presence of a disease, height, etc.). Traditional statistical methods5 are designed to analyze susceptibility of genes from gene expression data with phenotype by considering only a single gene at a time. On the contrary, it is proven that multiple genes act together to cause many common diseases. There are numerous challenges in designing and analyzing joint effects of multiple differentially expressed genes. Nowadays with next generation sequencing methods (e.g., RNA-seq, CAGE, etc.) specific transcript expression can be identified. The total number of human transcripts with protein-coding potential is estimated to be at least 204,950. As stated earlier the expressions of those transcripts are measured from not more than several thousand individuals. Here we have many more variables p than samples n. Classical model requires p < n where we have p >> n. Therefore, standard multi-variable statistical approaches are not very promising tool to detect complex gene-gene interactions from genome-wide data. On the contrary, machine learning algorithms provide several intuitive alternatives to perform multi-gene analysis within acceptable time, accuracy, memory, and money.
However, machine learning tools also suffer from very underdetermined systems where we have many more variables compared to the number of samples. Consequently, it is often reported that several different spurious feature (such as, genes or single nucleotide polymorphisms) subsets may yield equally optimal results. This phenomenon is known as instability. Fortunately, ensemble technique can be employed in the context of supervised gene selection domain. In statistics and machine learning, ensemble learning is a machine learning technique where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use. We could obtain better predictive performance by combining multiple learning algorithms instead of employing only one learner. Specifically, generalization ability of an ensemble is usually much stronger than that of a single learner. Considering these facts, we have developed a very robust and stable gene selection algorithm to select the most discriminating non-spurious set of genes from the gene expression datasets with phenotypes. Stability and robustness are ensured by class and instance levels perturbations, respectively. We have performed rigorous experimental evaluations using 10 real microarray gene expression datasets with known phenotypes. It revealed that our algorithm outperforms state-of-the-art algorithms existing in this domain with respect to stability, robustness, and classification accuracy. We have also done biological enrichment analysis based on gene ontology-biological processes (GO-BP) terms, disease ontology (DO) terms, and biological pathways.
WEIGHTED BIOLOGICAL PATHWAY NETWORK ANALYSIS.
Alzheimer's disease (AD) is the most common form of dementia among older people. It is a complex disease and the genetics and environmental factors behind it are not conclusive yet. Traditional statistical analyses are inadequate to identify variants, genes, or pathways capable of explaining AD as a unit. In this context, pathway network analysis based on a set of curated AD-specific genes identified in the literature can elucidate biological mechanisms underneath AD. Through the network, we can infer influential pathways that can together explain AD. Consequently, we can target those pathways and corresponding genes for further analysis to develop new drugs, discover novel AD-related genes, combine multiple hypotheses, and so forth.
We have developed a novel graph theoretic algorithm that can elucidate complex biology from a given set of disease-related genes. It constructs a weighted network of enriched pathways where similarity score between a pair of pathways is defined in a context-specific manner. To make the network robust, we employ topological overlap techniques on top of the raw similarity measure. We then provide the importance of each pathway with respect to the entire network, functional modules and importance of each pathway in a specific module, gene clusters, and so forth. We also provide a method to identify a set of novel genes that can further explain the disease-related genes and the disease itself.
We have employed our algorithms onto a set of AD-specific genes. It identified three distinct functional modules that are related to metabolism, cancer, and infectious disease related pathways. These findings are matched with three recognized hypotheses in Alzheimer's disease, e.g. "metabolism hypothesis," "cell cycle hypothesis," and "infectious disease hypothesis." By analyzing the curated genes common among those functional modules, we can attain more understanding about this fateful disease. We have also identified 24 novel AD-related genes of which at least 14 genes are known to be involved in AD.
HAPLOTYPE PHASING.
Inference of haplotypes, or the sequence of alleles along the same chromosomes, is a fundamental problem in genetics and is a key component for many analyses including admixture mapping, identifying regions of identity by descent, and imputation. In this work we focused on polyploid haplotype phasing where we aim to phase more than two haplotypes at the same time from sequencing data. The problem is much more complicated as the search space becomes much larger and the haplotypes do not need to be complimentary any more. We proposed two algorithms, (1) Poly-Harsh, a Gibbs Sampling based algorithm which alternatively samples haplotypes and the read assignments to minimize the mismatches between the reads and the phased haplotypes, (2) An efficient algorithm to concatenate haplotype blocks into contiguous haplotypes. Our experiments showed that our method is able to improve the quality of the phased haplotypes over the state-of-the-art methods. To the best of our knowledge, our algorithm for haplotype blocks concatenation is the first algorithm that leverages the shared information across multiple individuals to construct contiguous haplotypes. Our experiments showed that it is both efficient and effective. These algorithms have been published in BMC Genomics 2018.
BIOLOGICAL SEQUENCE COMPRESSION
Next-generation sequencing (NGS) technologies are producing millions to billions of short reads from DNA molecules simultaneously in a single run within a very short time period, leading to a sharp decline in whole genome sequencing costs. As a result, we are observing an explosion of genomic data from various species. Storing these data is an important task that the biologists must perform daily. To save space, compression could play an important role. Also, when the size of the data transmitted through the Internet increases, the transmission cost and congestion in the network also increase proportionally. Here again compression could help. Although we can compress the sequencing data through standard general-purpose algorithms, these algorithms may not compress the biological sequences effectively, since they do not exploit inherent properties of the biological data. Genomic sequences often contain repetitive elements, e.g. microsatellite sequences. The input sequences might exhibit high levels of similarity. An example will be multiple genome sequences from the same species. Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited. General purpose algorithms do not exploit these properties. The following five versions of the compression problem have been identified in the literature: (i) genome compression with a reference: here we are given many (hopefully very similar) genomic sequences. The goal is to compress all the sequences using one of them as the reference. The idea is to utilize the fact that the sequences are very similar. For every sequence other than the reference, we only must store the difference between the reference and the sequence itself; (ii) reference-free genome compression: this is the same as Problem 1, except that there is no reference sequence. Each sequence has to be compressed independently; (iii) reference-free reads compression: it deals with compressing biological reads where there is no clear choice for a reference; (iv) reference-based reads compression: in this technique, complete reads data need not be stored but only the variations with respect to a reference genome are stored; and (v) metadata and quality scores compression: in this problem, we are required to compress quality sequences associated with the reads and metadata such as read name, platform and project identifiers.
In this project, we focus on Problem i, iii, and v. The genomes from the same species (such as, human) are highly similar. Thus, by taking one genome as a reference we can effectively compress multiple target genomes by recording only the differences (i.e., variations) between the target and reference genomes. This presents a substantial challenge in computer science due to the sheer size of the human genome. Existing methods often fail to compress genomes meaningfully and they also suffer from high execution time and memory usage. To address these issues, we have developed novel and highly efficient referential genome compression algorithms that can be used to store, retrieve and send biological data very quickly and efficiently by developing novel local search and penalty-based local alignment algorithms. Extensive experimentations on real datasets revealed that our methods are indeed very effective and efficient, and outperformed notable algorithms in this domain. Let’s consider problem iii. Every day scientists are sequencing the whole genomes of new species by generating the sequencing short fragments (called reads) from DNA molecules. Here we do not have any clear choice of reference. The number of such short sequencing reads could be billions for a species and without compression it could take terabytes of disk storage. To compress reads effectively we need to detect common regions shared by the highly similar reads. This presents a substantial challenge in computer science due to the huge number of reads. Considering this fact, we have developed a highly novel and efficient multiple local alignment method to detect such common regions shared by reads. The method clustered the reads where each cluster contains reads with largest overlaps. For each cluster we built a consensus sequence and the variations of each read of a cluster against the consensus are then encoded and saved in the disk along with the consensus. Rigorous experimentations on real datasets revealed that our method is indeed very effective and efficient and outperformed state-of-the-arts techniques in this domain.
BIOLOGICAL SEQUENCE ERROR CORRECTION.
Due to the limitations of the sequencing technologies, there could be errors in the reads e.g., some nucleotides can be randomly changed, deleted and/or inserted in the reads. The sequencing errors complicate some research fields related to short read analysis, including re-sequencing, single nucleotide polymorphism (SNP) calling, and genome assembly. Sequence assembler often fail to produce continuous stretch of DNA sequence if the reads contain errors and so, it would be impossible to identify which stretches of DNA molecule contain genes, as well as, analyze those genes to detect potential changes in the sequence that may cause diseases and other abnormalities. Again, if the reads contain errors, it will be very difficult and often impossible to find the structural variations among the individuals. It is very important as many of these variants may be linked to debilitating genetic diseases in humans such as but not limited to cancers, dementia, Alzheimer’s and Parkinson’s disease, and so on. Therefore, error correction is a very crucial step in the fields of biological sequence analysis.
Error correction is particularly a very challenging task due to the facts that we have a massive amount of reads and inaccurate modification of reads will affect all types of genome analysis. To address these issues in this project we developed both de novo (without utilizing any reference genome) and hybrid (with utilizing a reference genome) error correction algorithms to correct substitutions errors in the reads. To correct erroneous bases of a read, we find other neighboring reads that are coming from the same genomic regions and sufficiently overlapped with that read. This is done by developing a novel hashing scheme. Reads containing identical subsequence fall in the same hash bucket. By analyzing all such buckets, we carefully create a list of neighboring reads for each read having largest overlaps with it. Each read is then aligned with its neighbors using our greedy alignment algorithm. After alignment we detect and correct each erroneous base of a read by taking the consensus nucleotide from all other neighboring reads in that position. We have extended the idea of the de novo error correction and developed a novel hybrid method (first of this kind) for correcting short biological sequencing reads. The basic idea is to use a reference sequence. We create two disjoint subsets of reads from the set of given reads. One subset contains reads that are uniquely aligned onto the reference within a small error. Other subset contains the rest of the unaligned reads. Since similar reads aligned onto the reference span the same genomic position, we detect consensus nucleotide that occurs most in each such position. We replace all other nucleotides of the reads spanning the same genomic position with the consensus nucleotide. Reads that could not be aligned onto the reference are corrected using our de novo error correction algorithm. Unlike other notable algorithms our methods automatically tune the parameters. Rigorous experimental evaluations on real bacterial sequence reads datasets show that our methods outperformed in terms of accuracy, sensitivity, and execution time other notable methods.
SPLICED JUNCTION DISCOVERY
Transcription is the first step of gene expression, where one or more segments (introns) of the DNA are removed and the remaining segments (exons) are concatenated to form the mature mRNA product. This biological process is called RNA splicing. These mRNAs are then translated to various proteins which are responsible to do every work in our body. In short: RNA splicing is an important post-transcriptional step where one or more segments of the pre-mRNA are spliced out and the remaining segments (exons) are concatenated to form the mature mRNA product. By alternative splicing, it is possible to produce different transcripts (isoforms) from the same genetic locus. This process occurs in over 90% of multi-exon human genes and greatly increases the diversity of possible transcripts in the transcriptome. Aberrant RNA splicing has been found to be associated with many human diseases. For this reason, techniques to identify and quantify splicing events are important to biology and medicine. By detecting novel splice junctions, we can discover new genes or new functionality of known genes and their implications in human health and disease. We can also estimate the expression rate of the genes and thus can be able to relate genes with diseases. RNA-seq generates millions of short sequence fragments or reads in a single run within a very brief period. The reads obtained from RNA-seq technology can then be aligned to a reference genome in order to construct a whole-genome transcriptome map and thus can be used to measure levels of gene expression and to identify novel splice variants of genes. But to fully utilize such reads one needs to be able to accurately align the sequencing reads over the intron boundaries in a reference genome. The problem can be stated as: Given a position in the middle of a window of DNA sequence elements (nucleotides), decide whether this is an intron-exon boundary, exon-intron boundary or not and align the reads accordingly. This mapping problem represents a significant challenge given the small read lengths and inherent high error rates. There are some shortcomings in some of the existing best-known algorithms for identifying splice junctions: 1) the junction boundaries are predicted within a large range, 2) they take a long time and 3) the predictions of splice junctions are inaccurate. To address all these issues, we proposed a novel method to accurately detect the splice junctions considering all the shortcomings of the existing algorithms as stated above.
Our proposed algorithm runs in two phases. At the first phase it detects initially unmapped reads (IUMs) and constructs representatives from those IUMs. The representatives are then used to find the candidate splice junctions (SJs) by employing our novel gapped alignment procedure. In the second phase we prune the candidate splice junctions to reduce the false positives with the help of a novel machine learning technique. In this context our algorithm can be thought of as a self-learning algorithm. We first generate a set of candidate splice junctions. These candidates are then further processed to get a set of highly accurate splice junctions. These form the positive examples for the learning step. We also generate a set of negative examples. Every training example can be represented by some characteristics locally observed around it. These characteristics are called features or dimensions in any classification algorithm. We have derived 8 such features that may decide altogether whether a candidate SJ is accurate or inaccurate. These examples are then used to train the learner. The model learnt by the learner can then be used to identify splice junctions. Our learning is based on Support Vector Machines (SVMs). Experimental results show that my method is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms such as, TopHat2 in terms of accuracy, and runtime.
EFFICIENT SCAFFOLDING
Due to erroneous base calling and low read coverage, sequence assemblers often fail to sequence an entire DNA molecule and instead output a set of overlapping segments collectively called contigs. The final step of the sequencing process, called scaffolding, is to assemble the contigs into a correct order with precise orientation. Without scaffolding, we could not build complete genome sequence. Consequently, we could not be able to do any downward genetic analysis in a meaningful way such as, we would not be able to identify genes stretching more than one contig and therefore, their functionalities in human health and diseases will be unknown. So, scaffolding is a must step to get the complete genome sequence. We have proposed a set of very efficient algorithms to find the precise order and orientation of the contigs accurately within a reasonable amount of time.
The problem of finding the correct order of the contigs can be posed as the problem of finding a permutation of these contigs that optimizes an objective criterion. It is known to be NP-hard i.e., cannot be solved within a reasonable amount of time. Scaffolding techniques typically exploit additional information such as mate-pairs, pair-ends, or optical restriction maps. It is an optimization problem and is very hard to solve. In this project we have employed ORMs in the context of scaffolding to find the relative order and precise orientation of contigs produced by sequence assemblers. Our algorithms are quadratic time faster than the previous ORM based algorithm with higher accuracy. An optical restriction map (ORM) provides an ordered list of fragment sizes of the whole DNA molecule by cleaving it with a specific restriction enzyme called recognition sequence. Given a set of contigs we only know the ordered fragment length of the DNA molecule and the corresponding recognition sequence. At first, we search for the positions of occurrences of the recognition sequence in each contig. From out of those positions we create a list of ordered fragment sizes for each contig. To find the best placement of a contig we developed multiple novel penalty-based greedy placement algorithms. We align the ordered fragment lengths of a contig in each possible place of the ORM and compute a greedy placement score. A contig is penalized if it misses a recognition sequence for a placement. For all possible placement we chose the placement of a contig corresponding to the best score. The experimental results show that our proposed algorithms are superior in terms of both run time as well as accuracy.
GENOTYPE-PHENOTYPE CORRELATIONAL ANALYSIS
Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation among people. The genotype of an organism is defined as the sum of all its SNPs. The phenotype of an organism is the observable physical or biochemical characteristics (such as the presence of a disease, height) of an organism, determined by both genetic make-up and environmental influences. Researchers have found that there is a strong correlation between SNPs and genetic diseases (such as, cancer, diabetes, Alzheimer’s disease, dementia, etc.). Investigations that try to understand human variability using SNPs fall under genome-wide association study (GWAS). A crucial step in GWAS is the identification of the correlation between genotypes (SNPs) and phenotypes (i.e., characteristics such as the presence of a disease). Identifying correlation between genotypes and phenotypes can be modeled as the k-locus problem (where k is the number of SNPs) and multi-locus problem (where k is a variable). In human we have roughly 10 million SNPs and finding any correlation among them is very hard. In this project we have developed efficient methods for 2-locus, 3-locus and multi-locus problem.
In 2-locus problem (i.e., when k = 2), we represent each SNP as a binary vector. For each vector we cluster the vectors based on a randomly selecting subsequences. We iterate the same procedure multiple times. We then compute pair-wise Hamming distances from each pair of vectors in each cluster and output the most correlated pair of SNPs that has the maximum difference in correlation between case and control. We also extend the idea of 2-locus problem to solve the 3-locus problem. We first identify a constant number of the most relevant pairs of SNPs using the 2-locus algorithm. Followed by this we generate candidate triples by augmenting one SNP at a time to each of these pairs. For each candidate we compute a score and output the triple whose score is the maximum. We also developed algorithm for multi-locus problem where multiple highly correlated SNPs are acting together to cause a disease. Since the number of SNPs is very large (roughly 10 million in human genome), computing correlations among SNPs are computationally very difficult task. The computation could take years to complete. To effectively prune noisy and unimportant SNPs we project the space of SNPs in such a way so that the pair-wise distances between any pair of SNPs are preserved in lower dimensional space. To do it efficiently at first, we rank the SNPs based on their importance by employing machine learning algorithm. After selecting top such SNPs, we project those SNPs in lower k-dimensional space for various values of k. In the projected space, we get k such linear combinations of SNPs acting together to cause the disease.
SCALABLE CLUSTERING
Conventional clustering algorithms suffer from poor scalability, especially when the data dimension is very large. It may take even days to cluster large datasets. For applications such as weather forecasting, time plays a crucial role and such run times are unacceptable. It is perfectly relevant to get even approximate clusters if we can do so within a short period of time. Considering the facts, we proposed a novel deterministic sampling technique that can be used to speed up any clustering algorithm. Our empirical results show that the proposed algorithm results in a speedup of more than an order of magnitude over exact hierarchical clustering algorithms. Also, the accuracy obtained is excellent. In fact, on many datasets, we get an accuracy that is better than that of exact hierarchical clustering algorithms! Our algorithm is generic in nature and can be employed in the context of any other clustering technique (such as, k-means, k-medians, etc.) as well. This generic technique has been published in ADMA 2013.
SIMILAR PAIRS OF POINTS IDENTIFICATIONS
The closest pair of points problem or closest pair problem (CPP) is an important problem in computational geometry where we must find a pair of points from a set of points in metric space with the smallest distance between them. This problem arises in several applications such as, but not limited to clustering, graph partitioning, image processing, patterns identification, and intrusion detection. For an example, in air-traffic control, we must monitor aircrafts that come too close together, since this may potentially indicate a possible collision. Numerous algorithms have been presented for solving the CPP. For instance, on n points with O(1) dimension there exists an O(n log n) time algorithm for CPP. There also exist a few randomized algorithms with an expected linear run time complexity. However, these algorithms do not perform well in practice. The algorithms that are employed in practice have a worst-case quadratic run time complexity. In this article we present an elegant approximation algorithm for the CPP dubbed as “MSPP: Mining Similar Pair of Points.” It is faster than currently best-known algorithms while maintaining a very good accuracy. The proposed algorithm also detects a set of closely similar pair of points in terms of Euclidean distance and Pearson’s correlation coefficient and can be adapted in numerous real-world applications. We have demonstrated its applicability in clustering, dimension reduction, and co-expression networks.
TIME SERIES MOTIF DISCOVERY
Time Series Motif Mining (TSMM) is a crucial problem that can be thought of as CPP in a large dimensional space. In one version of the TSMM problem, we are given a sequence S of real numbers and an integer l. The goal is to identify two subsequences of S of length l each that are the most like each other (from among all pairs of subsequences of length ` each). These most similar subsequences are referred to as time series motifs. An l-mer is nothing but a contiguous subsequence of S of length l. A typical value for l of practical interest is several hundred (or more). For these values of l, the conventional algorithms for solving CPP take an unacceptable amount of time (because of the exponential dependence on the dimension). Designing an efficient practical and exact algorithm for the TSMM problem remains an ongoing challenge. To address the issues, we have designed both exact and approximate algorithms for TSMM. Although worst case time complexity of our exact algorithm is quadratic with respect to the number of subsequences, experimental evaluations show that it is indeed useful in practice. These results have been published in IEEE ICDM 2017.
RANDOMIZED FEATURE SELECTION
Feature selection is the problem of identifying a subset of the most relevant features in the context of model construction. This problem has been well studied and plays a vital role in machine learning. In this research endeavor we have explored efficient ways of finding the most relevant features from a set of features and invented a series of randomized search methods which are generic in nature and can be applied for any learning algorithm. To demonstrate the validity of our approaches, we have applied it in three different applications, namely, biological data processing, data integration, and materials property prediction. It is evident from the simulation results shown above that our proposed techniques are indeed reliable, scalable, and efficient. Our experiments also reveal that our feature selection algorithms perform better than some of the best-known algorithms existing in the current literature. These algorithms appeared in DMIN 2013 and IJFCS 2015.
SCALABLE ADDRESS AUTOCONFIGURATION
Address autoconfiguration is one of the fundamental issues in mobile ad hoc networks (MANET). A node (i.e., communicating device) must need some form of identity before participating in any sort of communication. So, each host in a MANET needs to be uniquely addressed so that the packets can be relayed hop-by-hop and delivered ultimately to the desired destination. Moreover, nodes in the MANET are free to move and organize themselves in an arbitrary fashion. Therefore, any fixed infrastructure-based solution for assigning identity (i.e., IP address) is not directly applicable to MANET. In this research, we proposed an efficient and scalable address autoconfiguration protocol that automatically configures a network by assigning unique IP addresses to all nodes with a very low overhead and minimal cost. Evenly distributed Duplicate-IP address Detection Servers are used to ensure the uniqueness of an IP address during IP address assignment session. Theses algorithmic techniques have been published in Journal of Network and Systems Management (JNSM) 2010 and Ad-Hoc Now 2009.
RELIABLE BROADCASTING
Conventional broadcasting protocols suffer from network congestion, frequent message losses and corruption of broadcast messages due to a vast number of duplicate packets transmitting in the network. In this paper, we propose an efficient, scalable and reliable broadcast protocol to send a message in the entire network where every node is guaranteed to receive the message with low overhead and minimal cost. The scalability and reliability of broadcast transmission is ensured by composing the entire network in a hierarchy of grids/squares. Simulation results show that our proposed algorithm is reliable, scalable, robust and outperforms some other promising broadcasting protocols that exist in the current literature. The protocol has been been published in IEEE AINA 2010.