Master Projects – Autumn 2018

New projects will be announced soon

Swiss Data Science Center
The Swiss Data Science Center is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH Domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center is composed of a large multi-disciplinary team of data and computer scientists, and experts in select domains, with offices in Lausanne and Zurich www.datascience.ch

Title: Query engine for decentralized knowledge representation

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center is developing a cloud-based platform for collaborative data science. The platform provides a one-stop shop to data and algorithms, enabling data scientists to easily discover and reproduce the work of their peers in a secure collaborative environment. To this end, the platform automatically record the data science workflows, and relationships between research artefacts (code, data, results), into a knowledge representation. Scientists can query this knowledge representation using clauses that may include relationship expressions such as find all research projects and results derived from a data set or a class of data sets.

This internship is about developing a Proof of Concept (PoC) to offer a unified query engine capable to answer the query when this knowledge representation is decentralized. In the proposed scenario, the query must be decomposed into subqueries and executed on multiple database servers, possibly hosted in different administrative domains governed by independent access rights.

Goals/Benefits:

  • Practical experience in developing complex large scale software systems
  • Becoming familiar with application containerization in cloud-based environment
  • Becoming familiar with state-of-the art big data solutions, such as database graphs
  • Working in an interactive and interdisciplinary research environment

Prerequisites:

  • Intermediate level experience in using Linux
  • Beginner experience with application containerization in cloud-based environment
  • Good Python or Scala programming skills
  • Good software engineering skills

Contact: Eric Bouillet eric.bouillet@epfl.ch


Title: Declarative workflow runner for Kubernetes

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center is developing a cloud-based platform for collaborative data science. The platform provides a one-stop shop to data and algorithms, enabling data scientists to easily discover and reproduce the work of their peers in a secure collaborative environment. To this end, the platform provides methods to express, share and run data science workflows contributed by the data scientists in the cloud. Workflows are currently formulated in the SDSC collaborative data science platform as Direct Acyclic Graphs (DAG) using the Common Workflow Language (CWL).

This internship is about designing a like declarative workflow language similar to GNU-make, and developing a Proof of Concept (PoC) to run the flows in a distributed application container orchestration environment such as Kubernetes.

Goals/Benefits:

  • Practical experience in developing complex large scale software systems
  • Becoming familiar with state-of-the art application containerization and orchestration technologies such as docker and kubernetes.
  • Becoming familiar with cloud-based application development.
  • Working in an interactive and interdisciplinary research environment.

Prerequisites:

  • Intermediate level experience in using Linux
  • Beginner level experience with application containerization and orchestration
  • Good Python or Scala programming skills
  • Good software engineering skills

Contact: Thiebaut Johann-Michael Raymond johann-michael.thiebaut@epfl.ch


Title: Attribute Based Access Control implementation for a collaborative data science platform

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The Swiss Data Science Center is developing a cloud-based platform for collaborative data science. The platform provides a one-stop shop to data and algorithms, enabling data scientists to easily discover and reproduce the work of their peers in a secure collaborative environment. Using this platform, users can access data and run data analytics in a cloud-based computing environment managed by the platform.

This internship is about designing, implementing and testing a proof of concept of an Attribute Based Access Control (ABAC) systems to authorize the access to the resource entities managed by the platform. The candidate will first demonstrate a policy decision point that grant access rights to users based on policies expressed in the form of Boolean rules that combine attributes from the user, the accessed resource and the environment. Next, the candidate will design an ABAC solution capable to operate in a federated environment, where resources are distributed across multiple administrative domains protected by respective policy decision points with individual access policies.

Goals/Benefits:

  • Practical experience in developing complex large scale software systems
  • Becoming familiar with state-of-the art application containerization and orchestration technologies such as docker and kubernetes.
  • Becoming familiar with cloud-based application development.
  • Becoming familiar with state of the art access control paradigms
  • Working in an interactive and interdisciplinary research environment.

Prerequisites:

  • Intermediate level experience in using Linux
  • Beginner level experience with application containerization and orchestration
  • Good Python or Scala programming skills
  • Good software engineering skills

Contact: Sandra Savchenko-de Jong sandra.dejong@epfl.ch


Title: Inferring interpretable structures from deep architectures

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The tremendous advancement in machine learning algorithms over the last few decades has accelerated the adoption of neural networks and deep learning architectures in many applications such as image classification, natural language processing, and human action recognition [1]. These recent methods have led to impressive performance that even come close to the ones of humans on certain recognition or classification tasks. However, these systems are often poorly understood and generally developed without performance guarantees. Even worse, their results are often hard to interpret by application domain experts; the deep neural network algorithms are mainly based on non-linear functions, which map raw data, such as image pixels, to some feature representations that are hard to interpret in terms of a priori domain knowledge. Thus, although the popular neural network architectures are highly successful in terms of performance, their lack of transparency is a significant impediment to their adoption as advanced data science techniques in sensitive applications such as medical diagnosis.

The goal of this project is to attempt to interpret deep architectures by studying the structure of their inner layer representations, and based on this structure to find coherent explanations about their classification decision. Towards that direction, we plan to use tools from graph theory and graph signal processing [3]. The obtained results will be compared with classical feature visualization techniques [2]. The proposed algorithm will be tested on classical computer vision datasets such as ImageNet, as well as on medical cancer images.

Goals/Benefits:

  • Research experience in the emerging topic of interpretability/explainability of deep nets. If successful, the project will lead to a scientific publication.
  • Practical experience with state-of-the-art deep learning architectures.
  • Exposure to advanced optimization techniques.

Prerequisites:

  • Experience with deep learning frameworks such as PyTorch or Tensorflow.
  • Good knowledge of Python.
  • Knowledge of discrete optimization is a plus.
  • Motivation to work in a challenging research topic.

References:

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015

[2] https://distill.pub/2018/building-blocks/

[3] D. I Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. Signal Processing Magazine, IEEE, vol. 30, num. 3, p. 83-98, 2013.

[4] Nicolas Papernot and Patrick McDaniel, “Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning”, arXiv:1803.04765, 2018. 


Contact: Dorina Thanou dorina.thanou@epfl.ch


Title: Deep learning for feature extraction of spatiotemporal data: application to climate science

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Unsupervised autoencoders [1,2] are used to reduce the dimensionality of the data in the absence of labeled data. The low-dimensional representation can be used for feature extraction, but also as input to other machine learning algorithms. Here, we will use the low-dimensional representations to estimate kernel similarities in the reduced space and compute the eigenvectors of an associated graph Laplacian [3]. The input to the autoencoder will be sequences of images capturing the spatiotemporal structure of the data.

We will apply this method to the analysis of satellite images of cloud cover (26 years of data). The eigenfunctions of the Laplacian computed in the original high-dimensional space capture physically meaningful patterns intrinsic to the atmosphere, such as the annual cycle, El Nino or diurnal cycle [4]. Our goal is to compare the graph Laplacian technique with the method outlined here, and verify if the dimension reduction using an autoencoder helps to improve the quality of the extracted signals.

Goals/benefits:

  • Working with machine learning and deep learning libraries in Python (pandas, scikit-learn, PyTorch)
  • Becoming familiar with the analysis of time series (power spectra, auto-correlation)
  • Working with real-world satellite observations
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Machine learning and deep learning (advanced or intermediate skills)
  • Python (advanced skills)
  • Interested in interdisciplinary applications

References:

[1] Tutorial autoencoders – http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

[2] G.E. Hinton and R.R. Salakhutdinov, “Reducing the dimensionality of data using neural networks”, Science, 2006

[3] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[4] E. Szekely, D. Giannakis, A.J. Majda, “Extraction and predictability of coherent intraseasonal signals in infrared brightness temperature data”, Climate Dynamics, 2016

Contact: Eniko Szekely eniko.szekely@epfl.ch


Title: Dimension reduction and prediction for spatiotemporal data: application to climate science

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Dynamical systems such as the climate are highly nonlinear, and despite the fact that the observations are high-dimensional, most of the dynamics is captured by a small number of physically meaningful patterns.

In the first part of the project we will use unsupervised dimension reduction techniques for feature extraction, and compare linear techniques (e.g., Principal Component Analysis) with nonlinear kernel-based techniques (e.g., Laplacian Eigenmaps [1] and Diffusion Maps [2]). In the second part of the project we will forecast future values of the Laplacian eigenvectors using nonlinear regression techniques, such as Gaussian processes [3].

The data are three-dimensional (latitude*longitude*time) real-world global observations of temperature (and possibly rainfall), and observations are available for over 100 years. For example, one thing we are interested in is to extract the climate change trend in temperature and predict its future values.

Goals/benefits:

  • Working with machine learning techniques and time series analysis
  • Working with machine learning libraries in Python (pandas, scikit-learn)
  • Working with real-world observations
  • Advancing research on an interdisciplinary problem
  • Possibility to publish a research paper

Prerequisites:

  • Linear algebra
  • Machine learning (intermediate skills)
  • Python (intermediate skills)
  • Interested in interdisciplinary applications

References:

[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering”, NIPS, 2001

[2] R.R. Coifman and S.Lafon, “Diffusion maps”, Applied and computational harmonic analysis, 2006

[3] C.E. Rasmussen and C.K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press 2006

Contact: Eniko Szekely eniko.szekely@epfl.ch


Title: Self-regulating Generative Adversarial Networks

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

Generative Adversarial Networks (GANs) are notoriouly difficult to train. This is because the min-max nature of the problem is inherently unstable. Training therefore requires plenty of tweaks. In this project a feedback mechanism will be tried regulate the GAN tranining. When the discriminator is significantly stronger than the generator, the learning rate and/or the training iterations of the discriminator will be controlled based on the generator loss, and vice versa. Such feeback will introduce a self-regulating property to GAN training.

Goals/benefits:

  • Address a wide-spread problem in GAN training
  • Improve deep learning skills
  • Opportunity to publish a scientific paper

Prerequisites:

  • Knowledge of deep learning and GAN’s
  • Coding in python using Pytorch and/or Tensorflow
  • Interested in solving practical problems

Contact: Radhakrishna Achanta radhakrishna.achanta@epfl.ch


Title: Quantifying Inference Risks in Health Databases

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

The decreasing costs of molecular profiling has fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards a more precise and personalized medicine. In addition to biomedical data, preventive and social medicine practitioners increasingly use environmental data, such as location or pollution.

However, the release and usage of these intrinsically highly sensitive data poses a new threat towards privacy.

The goal of this project is to design an evaluation framework to systematize the analysis of inference attacks that exploit biomedical data, such as the genome, but also environmental data collected by research institutes and hospitals.  In this endeavor, you will make use of probabilistic graphical models or other machine-learning models and test your models with real datasets provided by the IUMSP (Institut Universitaire de Médecine Sociale et Préventive) at CHUV. Time permitting, you will also develop defense mechanisms to reduce the impact of these inference attacks while keeping high levels of utility for medical researchers.

Goals/Benefits:

  • Becoming familiar with probabilistic/machine-learning models
  • Access to real-life health-related datasets
  • Gaining experience in fields of growing importance
  • Working in an interdisciplinary research environment

Prerequisites:

  • Good Python and/or Matlab skills
  • Good background in probabilities and machine learning
  • Being interested in working in a multidisciplinary environment

Contact: Mathias Humbert mathias.humbert@epfl.ch


Title: Assessing and Thwarting Privacy Risks in Data Science Platforms

– closed

Laboratory: Swiss Data Science Center

Type: Master Thesis Project

Description:

In order to enable reproducibility of scientific studies, the Swiss Data Science Center (SDSC) is developing a flexible and scalable platform called Renga, which automates data provenance recoding, maintenance and traceability in the form of a knowledge graph. Due to the potential sensitive datasets that used in studies performed with Renga, one of the key challenges is to provide all the aforementioned features while guaranteeing a high level of privacy.

The goal of this project is to evaluate the feasibility and risk of various types of inference attacks against SDSC’s knowledge graph. In particular, you will investigate whether metadata exposed through Renga can leak sensitive information and, if so, develop countermeasures to mitigate this risk. You will also study how outputs of machine-learning models can expose membership in the dataset used to train these models and provide defense mechanisms to reduce the impact of this attack.

Goals/Benefits:

  • Acquiring knowledge on machine learning and privacy
  • Practical experience with a real-world data science platform
  • Gaining experience in fields of growing importance

Prerequisites:

  • Good background in machine learning and/or security and privacy
  • Good programming skills

More information:

Contact: Mathias Humbert mathias.humbert@epfl.ch