Projects – Autumn 2019

Swiss Data Science Center
The Swiss Data Science Center is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH Domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center is composed of a large multi-disciplinary team of data and computer scientists, and experts in select domains, with offices in Lausanne and Zurich www.datascience.ch

Projects


Title: Deep learning for feature extraction of spatiotemporal data: application to climate science

Open for applications

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Unsupervised autoencoders [1,2] are used to reduce the dimensionality of the data in the absence of labeled data. The low-dimensional representation can be used for feature extraction, but also as input to other machine learning algorithms. Here, we will use the low-dimensional representations to estimate kernel similarities in the reduced space and compute the eigenvectors of an associated graph Laplacian [3]. The input to the autoencoder will be sequences of images capturing the spatiotemporal structure of the data.

We will apply this method to the analysis of satellite images of cloud cover (26 years of data). The eigenfunctions of the Laplacian computed in the original high-dimensional space capture physically meaningful patterns intrinsic to the atmosphere, such as the annual cycle, El Nino or diurnal cycle [4]. Our goal is to compare the graph Laplacian technique with the method outlined here, and verify if the dimension reduction using an autoencoder helps to improve the quality of the extracted signals.

Goals/benefits:

  • Working with machine learning and deep learning libraries in Python (pandas, scikit-learn, PyTorch)
  • Becoming familiar with the analysis of time series (power spectra, auto-correlation)
  • Working with real-world satellite observations
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Machine learning and deep learning (advanced or intermediate skills)
  • Python (advanced skills)
  • Interested in interdisciplinary applications

References:

[1] Tutorial autoencoders – http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

[2] G.E. Hinton and R.R. Salakhutdinov, “Reducing the dimensionality of data using neural networks”, Science, 2006

[3] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[4] E. Szekely, D. Giannakis, A.J. Majda, “Extraction and predictability of coherent intraseasonal signals in infrared brightness temperature data”, Climate Dynamics, 2016

Contact: Eniko Szekely, eniko.szekely@epfl.ch


Title: Quantifying Inference Risks in Health Databases

Open for applications

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The decreasing costs of molecular profiling has fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards a more precise and personalized medicine. In addition to biomedical data, preventive and social medicine practitioners increasingly use environmental data, such as location or pollution.

However, the release and usage of these intrinsically highly sensitive data poses a new threat towards privacy.

The goal of this project is to design an evaluation framework to systematize the analysis of inference attacks that exploit biomedical data, such as the genome, but also environmental data collected by research institutes and hospitals.  In this endeavor, you will make use of probabilistic graphical models or other machine-learning models and test your models with real datasets provided by the IUMSP (Institut Universitaire de Médecine Sociale et Préventive) at CHUV. Time permitting, you will also develop defense mechanisms to reduce the impact of these inference attacks while keeping high levels of utility for medical researchers.

Goals/Benefits:

  • Becoming familiar with probabilistic/machine-learning models
  • Access to real-life health-related datasets
  • Gaining experience in fields of growing importance
  • Working in an interdisciplinary research environment

Prerequisites:

  • Good Python and/or Matlab skills
  • Good background in probabilities and machine learning
  • Being interested in working in a multidisciplinary environment

Contact: Mathias Humbert, mathias.humbert@epfl.ch


Title: Assessing and Thwarting Privacy Risks in Data Science Platforms

Open for applications

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

In order to enable reproducibility of scientific studies, the Swiss Data Science Center (SDSC) is developing a flexible and scalable platform called Renga, which automates data provenance recoding, maintenance and traceability in the form of a knowledge graph. Due to the potential sensitive datasets that used in studies performed with Renga, one of the key challenges is to provide all the aforementioned features while guaranteeing a high level of privacy.

The goal of this project is to evaluate the feasibility and risk of various types of inference attacks against SDSC’s knowledge graph. In particular, you will investigate whether metadata exposed through Renga can leak sensitive information and, if so, develop countermeasures to mitigate this risk. You will also study how outputs of machine-learning models can expose membership in the dataset used to train these models and provide defense mechanisms to reduce the impact of this attack.

Goals/Benefits:

  • Acquiring knowledge on machine learning and privacy
  • Practical experience with a real-world data science platform
  • Gaining experience in fields of growing importance

Prerequisites:

  • Good background in machine learning and/or security and privacy
  • Good programming skills

More information:

Contact: Mathias Humbert, mathias.humbert@epfl.ch


Title: Very large scale classification using Deep Learning

Open for applications

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The goal of this project is to perform supervised classification of image classes in the order of 10’000 to 20’000 in number. Such classification can not be accomplished using conventional deep networks that can classify up to 1000 classes. In order to achieve high scalability, the idea is to generate short descripters for each input image using contrastive or triplet loss. Simultaneously, a descriptor, which serves as a cluster center, is learnt for each of the classes. The final classification is achieved by finding the closest cluster center to the descriptor of any input image. A large labeled database will be provided for this project.

Goals/benefits:

  • Create and train a network, like DenseNet, for 32×32 and 64×64 sized images.
  • Scale up progressively from CIFAR-100 to ILSVRC-1000 to 20’000 classes.

Prerequisites:

  • Coding in Python
  • Knowledge of Deep learning and Pytorch/Tensorflow libraries.
  • Interest in solving real-world problems

Deliverables

  • Well-documented, clean code
  • Written report and oral presentation

Contact: Dorina Thanou, dorina.thanou@epfl.ch, Radhakrishna Achanta, radhakrishna.achanta@epfl.ch & Sofiane Sarni, sofiane.sarni@epfl.ch


Title: Pattern extraction from precipitation data

Open for applications

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The patterns of precipitation are changing at the regional level with changes in the mean amount, the extreme events that we are observing, etc. We are interested in understanding how the different moments of the probability distributions are changing over time. In order to do this, we will use unsupervised nonlinear dimension reduction techniques for feature extraction such as Laplacian eigenmaps. We would also like to predict future values of the Laplacian eigenvectors.

The data are three-dimensional (latitude*longitude*time) samples of precipitation from climate models. For example, one question we are interested in answering is: Will there be more extreme precipitation events in a certain region?

Goals/benefits:

  • Working with machine learning techniques and time series analysis
  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Working with real-world data
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

References:

[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[2] R.R. Coifman and S.Lafon, “Diffusion maps”, Applied and computational harmonic analysis, 2006

Contact: Eniko Szekely, eniko.szekely@epfl.ch


Title: Dictionary learning for material structure estimation in transmission electron microscopy

Open for applications

Laboratory: Swiss Data Science Center

Description:

The data considered in this project are very high-resolution hyperspectral images obtained by analytical transmission electron microscopy in the laboratory of Cécile Hébert. The spectra belong to two families of signal: X-ray signals (EDXS) and electron energy-loss signals (EELS).
The sample is a rectangular slice of material composed of different phases, each phase having a unique composition which can be characterized via its spectrum.

The data consists of hyperspectral images obtained in the horizontal plane below the slice by sending a vertical electron beam through the material. The signal obtained at each pixel can be thought of as a very noisy and quantized version of the average spectrum of the material along the vertical axis at that location of the horizontal plane. The main data analysis problem here consists in estimating as precisely as possible the different phases of the material and their corresponding spectra, in order, among other aims, to estimate the rare elements present in each of the phases.
A family of natural unsupervised approaches from machine learning that is relevant to try and identify the different phases, and their different spectral signatures, is the family of methods based on matrix factorization models, which are extensions of principal components analysis, independent component analysis, non-negative matrix factorization and the like.

The objective of this project is to work on a particular dictionary learning formulation of the problem with structural constraints and regularizations (simplex constraint, and Laplacian regularization), to make an efficient implementation using block-proximal methods. The idea is to cast the problem so that the dictionary elements are the ideal spectra corresponding to each phase and that the decomposition coefficients are exactly the proportion of each phase present at each pixel in the sample.
The challenges that lie ahead beyond the design of an efficient algorithm are: to be able to choose or design a loss function that correctly models the noise in the physical system; to be able to separate residues or particles that do not belong to any the main phases; and to be able to automatically select the right number of phases.
Several extensions of the problem are possible. Among others, while the EDX spectrum of a given pixel is the linear superposition of the spectra of individual elements, this not true anymore for EELS.  In particular, it is formed as a convolution of the different spectra, which can perhaps be leveraged to build a more complete model.

Goals/Benefits:

  • Experience in using machine learning techniques to model data in physics
  • Learning how to incorporate expert knowledge in specialized ML formulations
  • Gaining proficient in hand-on use of optimization algorithms

Prerequisites:

  • Machine learning course at the master level
  • Optimization algorithms (in particular proximal algorithms)
  • Proficiency in Python.

Advisors:

The student will be working under the guidance of Guillaume Obozinski, Deputy Chief Data Scientist at the Swiss Data Science Center, Prof. Cécile Hebert, Director of the Electron Spectrometry and Microscopy laboratory (LSME), and Hui Chen, PhD student at the LSME.

Contact: Guillaume Obozinski guillaume.obozinski@epfl.ch


Title: Signal processing on graphs: an application to temperature interpolation

Open for applications

Laboratory: Swiss Data Science Center

Description:

The goal of the project is to use machine learning and signal processing techniques to extract and understand patterns in temperature data. These patterns will be further used to interpolate temperatures at a given location using the information contained in the neighborhood graph. The graph is to be built either in the geographical space or in the data space, e.g., relying on additional information such as altitude.  The data are three-dimensional (latitude*longitude*time) samples of temperature (and possibly moisture) from climate models and real observations.

Goals/Benefits:

  • Working with machine learning and signal processing techniques
  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Working with real-world data 
  • Advancing research on an interdisciplinary problem 
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills) 
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

References:

[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[2] R.R. Coifman and S.Lafon, “Diffusion maps”, Applied and computational harmonic analysis, 2006

Contact: Eniko Szekely eniko.szekely@epfl.ch and Dorina Thanou dorina.thanou@epfl.ch