The recent progress in high throughput measurement technologies for molecular biology, such as high throughput sequencing, microarrays, multi-dimensional proteomics, glycomics and metabolomics is driving modern biology to heavily overlap data and information sciences. Computational biology and bioinformatics are research fields that focus on developing and efficiently applying computational data analysis and design algorithmics to address data science challenges presented by the fast progress of modern molecular biology.
Design and data analysis methods are playing a central role in enabling molecular based personalized medicine. They also play a role in the development of synthetic biology, driving our ability to efficiently design and utilize molecular devices.
Mining healthcare related data in various levels and from different organizations is fast emerging as a tool for improving patient care and human wellbeing.
The algorithmic and statistical aspects of these developing fields of science present diverse, deep and fascinating challenges.
Our group develops statistical and algorithmic methods for analyzing medical and molecular measurement data. We also develop optimal design algorithmics for measurement systems or for assays to increase efficiency and effectiveness. We often implement the methods into software tools that serve a broad scientific community.
We also apply data science, machine learning and statistical techniques to data from other domains.
Current and past projects include:
- Statistics in ranked lists – methods and tools to enable inference in noisy datasets. Related projects include:
- Minimum Hypergeometric Statistics (mHG)
Consider a ranked list of elements and a binary labeling associated with all elements in the list. The mHG statistic measures the density of 1s at the top of the resulting binary vector. Our work included the full characterization of the distribution of mHG assuming a uniform null model (Eden et al, PLoS CB 2007), the development of specialized tools that use mHG (see below) and the development of variants, generalizations and extensions (Leibovich et al, NAR 2012; Steinfeld et al, NAR 2013)
A tool for identifying and visualizing enriched GO terms in ranked lists of genes, based on the mHG statistics. (Eden et al, BMC Bioinformatics, 2009)
A tool for identifying enriched sequence motifs in ranked lists of sequences.
- miTEA and MULSEA.
Tools that support the analysis of miRNA targets in ranked lists of genes. (Steinfeld et al, NAR 2012; Cohn-Alperovich et al, Bioinformatics, 2016)
- Minimum Hypergeometric Statistics (mHG)
- Synthetic biology – We are interested in the design of reagents, application development and data analysis related to synthetic biology and genetic engineering in general. More specifically, we are using synthetic DNA libraries for studying and optimizing biological processes.
- Oligo Libraries (OLs)
Oligonucleotide libraries consist of a high multiplicity (tens of thousands) of short synthetic DNA molecules that are designed to perform a certain task or investigate a certain phenomenon or a set of them. The combination of the high multiplexing rate and the complete design control allows for a very diverse range of applications. For example – studying sequence determinants of TF binding (Sharon et al Nat Biotech 2012, Sharon et al Gen Res 2014, Levo et al Mol Cell 2017) and of IRES functionality (Gabay-Weingarten Science 2016), optimizing protein design (ongoing project) and studying bacterial insulators (Levy, Anavy et al in review).
- DNA Barcodes
Many high throughput experiments and measurements in synthetic biology involve the use of sample tags based on a DNA sequences. We design method for generating large, reliable and efficient tagging systems using concepts from coding and information theory.
- DNA based data storage
DNA based storage systems are considered as a potential solution for the surging demand for digital information storage. We develop novel methodologies to utilize the properties of synthetic DNA in order to increase the capacity and reduce the costs of such systems.
- Flux balance analysis (FBA)
Metabolic models of organisms can help design and optimize expression and production systems. One of our current active projects addresses the combination of several organisms in a fermentation process designed to optimize ethanol production (Vitkin et al Technology 2015, Jiang et al Sci Rep 2016). We also used high throughput fitness measurement results to infer genes coding to orphan E coli model reactions (Vitkin, Solomon et al in review).
- Oligo Libraries (OLs)
- Epigenetics – design of reagents, application development and data analysis as related to various aspects of molecular regulation in living cells.
- In collaborations with the Cedar Lab at HUJI we investigated DNA methylation and its sequence determinants (Schlesinger et al Nat Gen 2007, Straussman et al Nat SMB 2009). We further investigated the relationship between age related DNA methylation and silencing that is related to cancer (Nejman et al Cancer Res 2014).
- In collaboration with the Simon Lab at HUJI we investigated time of replication and associated properties of genomic regions in mice and men (Farkash-Amar et al Gen Res 2008, Farkash-Amar et al, PLoS One 2012). Arto is a software tool that support analyzing ToR data.
- Analysis of HiC data – We are interested in general embedding techniques (MDS and NMDS) and their application to HiC data. We also work on direct spatial statistical enrichment approaches (Ben Elazar et al NAR 2013, Ben Elazar et al Bioinformatics 2016).
- Computational biology in cancer and other disease; collaborative research in medical science – We develop statistical analysis and algorithms to support, improve and sometimes drive medical science studies, in collaboration with leading molecular medicine groups. Examples of projects can be found here.
- Data integration and joint analysis of data from several different platforms – methods and tools to enhance interpretation and inference in rich datasets. In breast cancer we studied the in-trans effects of copy number changes in cancer and pointed out events that lead to significant changes in pathway regulation (Ragle-Aure et al PLoS One 2013) and the direct activity of miRNA as driving disease related processes (Enerly, Steinfeld et al PLoS One 2011)
- Glycomics. In another breast cancer cohort we showed the association of serum glycans to molecular processes in the tumor (Haakensen et al Mol Onc 2015)
- Data analysis for single cell RNAseq – investigating questions related to the interpretation of emerging sequence related technologies
- Computational aspects of environmental science.
- Machine learning and data science
- Improved assay design. We apply optimization algorithms to optimize assay components. Examples include CGH probes (Barret et al PNAS 2003, Lipson et al Bioinformatics 2007) and optimized IEF-LC/MS (Kifer et al IEEE Bioinformatics 2017).