Research

The progress of high throughput measurement technologies for molecular biology, such as high throughput sequencing, microarrays, multi-dimensional proteomics and metabolomics, is driving modern biology to heavily overlap data and information sciences.

Computational biology and bioinformatics are research fields that focus on developing and efficiently applying computational and statistical data analysis and design algorithmics to address data science challenges of modern molecular biology and medical science:

  • Design and data analysis methods are playing a central role in enabling molecular based personalized medicine, including in fast growing fields like cancer therapy and non-invasive pre-natal diagnostics.
  • Design algorithmics and data analysis also play a role in the development of synthetic biology, driving our ability to efficiently design and utilize molecular devices.
  • Mining healthcare related data in various levels and from different organizations is fast emerging as a tool for improving patient care and human wellbeing. Machine learning tools are central to successful inference in such data.
  • Analyzing data in studies related to environmental protection and to energy efficiency is of great societal importance and is challenging in addressing specific properties, noise characteristics and diversity.

The algorithmic and statistical aspects of the above scientific endeavors present diverse, deep and fascinating challenges, and constitute the focus of our group’s research

We develop statistical and algorithmic methods for designing devices and assays and for analyzing data, and apply these methods in collaborative studies. We are interested in data derived from molecular and medical measurement and in environmental science data. We also apply data science and statistical techniques to data from other domains. On the design side we are interested in optimizing the design of synthetic devices and measurement systems to increase efficiency and effectiveness. We often develop software tools, based on the methods developed, that serve a broad scientific community. We are actively collaborating in studies that produce challenging data and are constantly striving to learn more about how to improve our approaches and about new challenges.

Some examples of active areas and past projects:

  • Development of statistical tools and methods, including statistics in ranked lists – methods and tools to enable inference in high dimensional data and in noisy datasets. Related projects include:
    • Minimum Hypergeometric Statistics (mHG)
      Consider a ranked list of elements and a binary labeling associated with all elements in the list. The mHG statistic measures the density of 1s at the top of the resulting binary vector. Our work included the full characterization of the distribution of the mHG statistic assuming a uniform null model (Eden et al, PLoS CB 2007), the development of specialized tools that use mHG (see some examples below) and the development of variants, generalizations and extensions (Leibovich et al, NAR 2012, Steinfeld et al, NAR 2013)
    • GOrilla a tool for identifying and visualizing enriched GO terms in ranked lists of genes, based on the mHG statistics.
      (e.g Steinfeld et al, Bioinformatics 2008, Eden et al, BMC Bioinformatics 2009, Leibovich et al, NAR 2012)
    • miTEA
      Identifying enriched miR targets in ranked lists of genes (Steinfeld et al NAR 2013). Based on an extension of the mHG statistics.
    • RCoS
    • Motif discovery
  • Synthetic biology and applications – design of reagents and application development for studying cellular regulation, efficient production and other biological processes
    • Using oligonucleotide libraries (OLs) to infer regulatory and other biological mechanisms
      Using variants of the protocol developed in Sharon et al NBT 2012 we design synthetic libraries and assays to study and characterize various biological processes and phenomena.
    • Modelling and designing controlled fermentation processes for biofuel
  • Molecular data analysis
    • Transcriptomics
    • DNA copy number data
    • Glycomics
    • HiC data analysis
    • Epigenetics
      in Farkash-Amar et al XXX, Yaffe et al XXX and XXX we report work with Itamar Simon’s Lab that characterized time of replication in mammals and its realationship with various sequence properties and chromatin states. In Straussman et al we describe novel microarray design for methylation, including the characterization of protected regions that is not covered by the classical island definition. Tne assay was developed and analyzed in collaboration with the Cedar Lab in the Hebrew U of Jerusalem. In Neijman et al, also a collaboration with the Cedar Lab, we characterize regions that are commonly methylated in cancer and note that the same regions are also increasingly methylated as humans age.
    • Data analysis for high throughput sequencing – investigating questions related to the interpretation of emerging sequence related technologies
    • Joint data analysis – methods and tools to enhance interpretation and inference in rich datasets
  • Computational biology in cancer and other disease; collaborative research in medical science – providing statistical analysis, algorithm development and design optimization to support, improve and sometimes drive medical science studies, in collaboration with leading molecular medicine groups. Includes analysis of single cell data.
  • Assay and device optimization
    • IEF optimization.
      We have developed computational methodology that, given a focus group of proteins, provides a subset of IEF fractions s.t. when measuring complex samples one can run LC-MS/MS only on these fractions and still get measurement results for >80% of the proteins. Doing so reduces LC-MS/MS expenses by a factor of 2-4 compared to running LC-MS/MS on all fractions.
      This work is described in Kifer et al XXX
    • Probe design
    • OL QC assessment