The recent progress in high throughput measurement technologies for molecular biology, such as high throughput sequencing, microarrays, multi-dimensional proteomics, glycomics and metabolomics is driving modern biology to heavily overlap data and information sciences. Computational biology and bioinformatics are research fields that focus on developing and efficiently applying computational data analysis and design algorithmics to address data science challenges presented by the fast progress of modern molecular biology.

Design and data analysis methods are playing a central role in enabling molecular based personalized medicine. They also play a role in the development of synthetic biology, driving our ability to efficiently design and utilize molecular devices.

Mining healthcare related data in various levels and from different organizations is fast emerging as a tool for improving patient care and human wellbeing.

The algorithmic and statistical aspects of these developing fields of science present diverse, deep and fascinating challenges.

Our group develops statistical and algorithmic methods for analyzing medical and molecular measurement data. We also develop optimization and design algorithmics for measurement systems or for assays and reagents to increase efficiency and effectiveness. For example: DNA storage …

We often implement the methods into software tools that serve a broad scientific community. We are particularly interested in the efficient and effective use of synthetic nucleic acids in several contexts.

In collaboration with the FACT center at IDC we are working on cryptographic techniques to enable privacy preserving machine learning and inference performed on data from several independent parties.

We also apply data science, machine learning and statistical techniques to data from other domains.

Current and past projects include:

  1. Synthetic biology.
    We are interested in the design of reagents, application development and data analysis related to synthetic biology and genetic engineering in general. More specifically, we are using synthetic DNA libraries for studying and optimizing biological processes.
    DNA based data storage
    DNA based storage systems are considered as a potential solution for the surging demand for digital information storage. We develop novel methodologies to utilize the properties of synthetic DNA in order to increase the capacity and reduce the costs of such systems.
    Oligo Libraries (OLs)
    Oligonucleotide libraries consist of a high multiplicity (tens of thousands) of short synthetic DNA molecules that are designed to perform a certain task or investigate a certain phenomenon or a set of them. The combination of the high multiplexing rate and the complete design control allows for a very diverse range of applications. For example – studying sequence determinants of TF binding (Sharon et al Nat Biotech 2012, Sharon et al Gen Res 2014, Levo et al Mol Cell 2017) and of IRES functionality (Gabay-Weingarten Science 2016), optimizing protein design (ongoing project) and studying bacterial insulators (Levy, Anavy et al Cell Reports 2016).Slide1
    QC and statistics for OLs
    In collaboration with Eitan Yaakobi’s group we are developing methods and tools for analyzing synthetic oligo libraries and for inferring and reporting library characteristics.
    DNA Barcodes
    Many high throughput experiments and measurements in synthetic biology involve the use of sample tags based on a DNA sequences. We design method for generating large, reliable and efficient tagging systems using concepts from coding and information theory.
    Flux balance analysis (FBA)
    Metabolic models of organisms can help design and optimize expression and production systems. One of our current active projects addresses the combination of several organisms in a fermentation process designed to optimize ethanol production (Vitkin et al Technology 2015, Jiang et al Sci Rep 2016). We also used high throughput fitness measurement results to infer genes coding to orphan E coli model reactions (Vitkin, Solomon et al, BMC Bioinformatics 2018).
  2. Statistics in ranked lists
    Methods and tools to enable inference in noisy datasets. Related projects include:
    Minimum Hypergeometric Statistics (mHG)
    Consider a ranked list of elements and a binary labeling associated with all elements in the list. The mHG statistic measures the density of 1s at the top of the resulting binary vector. Our work included the full characterization of the distribution of mHG assuming a uniform null model (Eden et al, PLoS CB 2007), the development of specialized tools that use mHG (see below) and the development of variants, generalizations and extensions (Leibovich et al, NAR 2012; Steinfeld et al, NAR 2013)
    GOrilla is a tool for identifying and visualizing enriched GO terms in ranked lists of genes, based on the mHG statistics. (Eden et al, BMC Bioinformatics, 2009)
    an image of the Gorrila tools
    DRIMust is a tool for identifying enriched sequence motifs in ranked lists of sequences.
    miTEA and MULSEA
    miTEA and MULSEA are tools that support the analysis of miRNA targets in ranked lists of genes. (Steinfeld et al, NAR 2012; Cohn-Alperovich et al, Bioinformatics, 2016)
  3. Privacy preserving machine learning
  4. Epigenetics
    Design of reagents, application development and data analysis as related to various aspects of molecular regulation in living cells.
    DNA methylation
    In collaborations with the Cedar Lab at HUJI we investigated DNA methylation and its sequence determinants (Schlesinger et al Nat Gen 2007, Straussman et al Nat SMB 2009). We further investigated the relationship between age related DNA methylation and silencing that is related to cancer (Nejman et al Cancer Res 2014).
    We produced a model that predicts DNA methylation for a given sample in any CpG position based solely on the sample’s gene expression profile and the sequence surrounding the CpG. Depending on gene-CpG proximity, our model attains a Spearman correlation of up to 0.84 for thousands of CpG sites on two separate test sets of CpG positions and subjects (cancer and healthy samples). Our approach, especially the use of attention, offers a novel framework with which to extract valuable insights from gene expression data when combined with sequence information.

    Deep learning models for the prediction of methylation status of individual CpGs
    Time of replication (ToR)
    In collaboration with the Simon Lab at HUJI we investigated time of replication and associated properties of genomic regions in mice and men (Farkash-Amar et al Gen Res 2008, Farkash-Amar et al, PLoS One 2012). Arto is a software tool that support analyzing ToR data.
  5. Analysis of HiC data
    We are interested in general embedding techniques (MDS and NMDS) and their application to HiC data. We also work on direct spatial statistical enrichment approaches (Ben Elazar et al NAR 2013, Ben Elazar et al Bioinformatics 2016). In recent work we extended the methods to seek statistical enrichment in arbitrary points in space and used the methods to elucidate the 3D functional organization of unicellular genomes.
  6. Computational biology in cancer and other disease
    Collaborative research in medical science – We develop statistical analysis and algorithms to support, improve and sometimes drive medical science studies, in collaboration with leading molecular medicine groups. Examples of projects:
    o Data integration and joint analysis of data from several different platforms
    Methods and tools to enhance interpretation and inference in rich datasets. In breast cancer we studied the in-trans effects of copy number changes in cancer and pointed out events that lead to significant changes in pathway regulation (Ragle-Aure et al PLoS One 2013) and the direct activity of miRNA as driving disease related processes (Enerly, Steinfeld et al PLoS One 2011)
    o Cross platform normalization for miRNA in breast cancer
    In another breast cancer cohort we showed the association of serum glycans to molecular processes in the tumor (Haakensen et al Mol Onc 2015)
    o Data analysis for single cell RNA-Seq
    Investigating questions related to the interpretation of emerging sequence related technologies.
    o Digital pathology and molecular measurement data.

    In this work our goal is to create a molecular cartography of the tumor micro-environment. We train a weakly-supervised model using bulk molecular measurements instead of pathologists’ annotations to detect molecular traits on WSIs.

    o Cryptography protocols for machine learning and data sharing

  7. Computational aspects of environmental science
    In collaboration with Alex Golberg’s Lab at TAU we are working on optimizing and investigating various stages of biorefineries designed to produce fuel from readily available algae.
    Machine learning and data science – applications in other domains
    o Stealth detection in computer networks
    o IoT or NOT
    o Interpolation in the latent representation of images

  8. Improved assay design and inference
    We apply optimization algorithms to optimize assay components and to assess performance. Examples include:
    o CGH probes (Barret et al PNAS 2003, Lipson et al Bioinformatics 2007) 
    o Optimized IEF-LC/MS
    (Kifer et al IEEE Bioinformatics 2017).
    o SNIRO
    o CRISPECTOR. A software tool for analyzing CRISPR performance.