lunduniversity.lu.se

Denna sida på svenska This page in English

Dynamic Fault­-Tolerance and Task Scheduling in Distributed Systems

2016-02-04

Jonatan Broberg and Philip Ståhl, currently working with their master thesis as MAPCI

How to provide a high level of reliability on general purpose computing platforms on top of infrastructure that has a lower level of reliability than the desired? That is what two students, Jonatan Broberg and Philip Ståhl, currently are investigating in their thesis work at MAPCI. In practice, they will design and implement a fault­tolerance and scheduling framework.

Ensuring a certain level of reliability is a major concern for applications running in distributed environments. Cloud systems usually consists of heterogeneous commodity hardware, and ensuring a level of reliability for an application is therefore a complex task, since any of the vast number of the system components may fail at any time. 

Reliability can be increased by cloning application tasks, where all clones produce the same output. In case of stream processing this is important to avoid losing any data. By replicating both producers and consumers, the consumers can easily detect if one of the producers fails, e.g. due when no message is received for some time. This is based on the assumption that the producers send data to the receivers on a regular basis.

In the case of a failure, a scheduling mechanism can spawn a new replica, picking up where the failed one stopped. Furthermore, reliability in terms of not losing any data can be ensured by having enough replicas.

The main aim for the thesis work is to devise a method to provide carrier grade reliability on top of infrastructure that is less than carrier grade. In practice, Jonatan Broberg and Philip Ståhl will design and implement a fault­tolerance and scheduling framework. It will be implemented in Ericsson’s actor based open­source environment Calvin.

The investigation is of particular interest since they address a fundamental question in the industry, i.e. how to provide a high level of reliability on general purpose computing platforms each with a lower level of reliability than the desired.

The students plan a verification by a real­world experiment on a server cluster.

Jonatan Broberg studies Electrical engineering, specializing in software development. Philip Ståhl studies Computer Science, specializing in software development. The master thesis work supervisor is Björn Landfeldt, director for MAPCI.