Research

The recent progress in high throughput measurement technologies for molecular biology, such as high throughput sequencing, microarrays, multi-dimensional proteomics, glycomics and metabolomics is driving modern biology to heavily overlap data and information sciences. Computational biology and bioinformatics are research fields that focus on developing and efficiently applying computational data analysis and design algorithmics to address data science challenges presented by the fast progress of modern molecular biology.

Design and data analysis methods are playing a central role in enabling molecular based personalized medicine. They also play a role in the development of synthetic biology, driving our ability to efficiently design and utilize molecular devices.

Mining healthcare related data in various levels and from different organizations is fast emerging as a tool for improving patient care and human wellbeing.

The algorithmic and statistical aspects of these developing fields of science present diverse, deep and fascinating challenges.

Our group develops statistical and algorithmic methods for analyzing medical and molecular measurement data. We also develop optimization and design algorithmics for measurement systems or for assays and reagents to increase efficiency and effectiveness. For example: DNA storage …

We often implement the methods into software tools that serve a broad scientific community. We are particularly interested in the efficient and effective use of synthetic nucleic acids in several contexts.

In collaboration with the FACT center at IDC we are working on cryptographic techniques to enable privacy preserving machine learning and inference performed on data from several independent parties.

We also apply data science, machine learning and statistical techniques to data from other domains.

Current and past projects include:

  1. Synthetic biology.
    We are interested in the design of reagents, application development and data analysis related to synthetic biology and genetic engineering in general. More specifically, we are using synthetic DNA libraries for studying and optimizing biological processes.

    1. DNA based data storage
      Synthesis of composite DNADNA based storage systems are considered as a potential solution for the surging demand for digital information storage. We develop novel methodologies to utilize the properties of synthetic DNA in order to increase the capacity and reduce the costs of such systems.  In our most recent publication in Nature Biotechnology We introduce the use of composite DNA letters to increase the logical density of DNA storage above the strict, single molecule, theoretical limit of 2 bits per synthesis cycle.
      [Read More]
    2. Oligonucleotide Library (OL) based assays
      Oligonucleotide libraries consist of a high multiplicity (tens of thousands) of short synthetic DNA molecules that are designed to perform a certain task or investigate a certain phenomenon or a set of them.
      [Read More]
    3. QC and statistics for OLs
      In collaboration with Eitan Yaakobi’s group we are developing methods and tools for analyzing synthetic oligo libraries and for inferring and reporting library characteristics. This work was accepted to be presented at the NVMW 2020 conference.
    4. DNA Barcode design
      Many high throughput experiments and measurements in synthetic biology involve the use of sample tags based on a DNA sequences. We design method for generating large, reliable and efficient tagging systems using concepts from coding and information theory.
    5. Flux balance analysis (FBA)
      Metabolic models of organisms can help design and optimize expression and production systems. One of our current active projects addresses the combination of several organisms in a fermentation process designed to optimize ethanol production (Vitkin et al Technology 2015, Jiang et al Sci Rep 2016). We also used high throughput fitness measurement results to infer genes coding to orphan E coli model reactions (Vitkin, Solomon et al, BMC Bioinformatics 2018).
  2. Statistics in ranked lists
    Methods and tools to enable inference in noisy datasets. Related projects include:

    1. Minimum Hypergeometric Statistics (mHG)
      Consider a ranked list of elements and a binary labeling associated with all elements in the list. The mHG statistic measures the density of 1s at the top of the resulting binary vector. Our work included the full characterization of the distribution of mHG assuming a uniform null model (Eden et al, PLoS CB 2007), the development of specialized tools that use mHG (see below) and the development of variants, generalizations and extensions (Leibovich et al, NAR 2012; Steinfeld et al, NAR 2013)
    2. GOrilla
      an image of the Gorrila toolsGOrilla is a tool for identifying and visualizing enriched GO terms in ranked lists of genes, based on the mHG statistics. (Eden et al, BMC Bioinformatics, 2009)
    3. DRIMust
      DRIMust is a tool for identifying enriched sequence motifs in ranked lists of sequences.
    4. miTEA and MULSEA
      miTEA and MULSEA are tools that support the analysis of miRNA targets in ranked lists of genes. (Steinfeld et al, NAR 2012; Cohn-Alperovich et al, Bioinformatics, 2016)
  3. Privacy preserving machine learning

    In work with Adi Akavia (Haifa Univ), Hayim Shaul (IDC) and Mor Weiss (IDC) we developed a protocol for privacy preserving linear regression (WHAC 2019). We are currently working on extending this approach to enable more privacy preserving machine learning algorithms.
  4. Epigenetics
    Design of reagents, application development and data analysis as related to various aspects of molecular regulation in living cells.

    1. DNA methylation
      In collaborations with the Cedar Lab at HUJI we investigated DNA methylation and its sequence determinants (Schlesinger et al Nat Gen 2007, Straussman et al Nat SMB 2009). We further investigated the relationship between age related DNA methylation and silencing that is related to cancer (Nejman et al Cancer Res 2014).
    2. Deep learning models for the prediction of methylation status of individual CpGs
      We produced a model that predicts DNA methylation for a given sample in any CpG position based solely on the sample’s gene expression profile and the sequence surrounding the CpG. Depending on gene-CpG proximity, our model attains a Spearman correlation of up to 0.84 for thousands of CpG sites on two separate test sets of CpG positions and subjects (cancer and healthy samples). Our approach, especially the use of attention, offers a novel framework with which to extract valuable insights from gene expression data when combined with sequence information.
    3. Time of replication (ToR)
      In collaboration with the Simon Lab at HUJI we investigated time of replication and associated properties of genomic regions in mice and men (Farkash-Amar et al Gen Res 2008, Farkash-Amar et al, PLoS One 2012). Arto is a software tool that support analyzing ToR data.
  5. Analysis of HiC data
    We are interested in general embedding techniques (MDS and NMDS) and their application to HiC data. We also work on direct spatial statistical enrichment approaches (Ben Elazar et al NAR 2013, Ben Elazar et al Bioinformatics 2016). In recent work we extended the methods to seek statistical enrichment in arbitrary points in space and used the methods to elucidate the 3D functional organization of unicellular genomes.
  6. Computational biology in cancer and other disease
    1. Collaborative research in medical science
      We develop statistical analysis and algorithms to support, improve and sometimes drive medical science studies, in collaboration with leading molecular medicine groups. Examples of projects.
    2. Data integration and joint analysis of data from several different platforms
      Methods and tools to enhance interpretation and inference in rich datasets. In breast cancer we studied the in-trans effects of copy number changes in cancer and pointed out events that lead to significant changes in pathway regulation (Ragle-Aure et al PLoS One 2013) and the direct activity of miRNA as driving disease related processes (Enerly, Steinfeld et al PLoS One 2011)
    3. Cross platform normalization for miRNA in breast cancer
    4. Glycomics.
      In another breast cancer cohort we showed the association of serum glycans to molecular processes in the tumor (Haakensen et al Mol Onc 2015)
    5. Data analysis for single cell RNA-Seq
      Investigating questions related to the interpretation of emerging sequence related technologies.
    6. Digital pathology and molecular measurement data.
      In this work our goal is to create a molecular cartography of the tumor micro-environment. We train a weakly-supervised model using bulk molecular measurements instead of pathologists’ annotations to detect molecular traits on WSIs.
  7. Computational aspects of environmental science
    In collaboration with Alex Golberg’s Lab at TAU we are working on optimizing and investigating various stages of biorefineries designed to produce fuel from readily available algae.
  8. Machine learning and data science – applications in other domains
    1.  Stealth detection in computer network
    2.  IoT or NOT
    3. Interpolation in the latent representation of images
  9. Improved assay design and inference
    We apply optimization algorithms to optimize assay components and to assess performance. Examples include:

    1. CGH probes (Barret et al PNAS 2003, Lipson et al Bioinformatics 2007)
    2. Optimized IEF-LC/MS (Kifer et al IEEE Bioinformatics 2017).
    3. SNIRO (Peleg et al Analytica Chim Acta 2019)
    4. CRISPECTOR. A software tool for analyzing CRISPR performance.