Friday, February 12, 2010

Node failure

In a distributed system a node can fail/crash in a lot of way and it is very important that we understand which failure model we are supporting while defining an algorithm. Because execution of an algorithm can change depending on the failure model.

Failure model

A node can fail in four way.
  1. Crash Stop
  2. Omissions
  3. Crash Recovery
  4. Byzantine / Arbitrary
The properties of each failure model is following,

Crash Stop

  • Node stop doing anything(sending/receiving/processing)
  • Once fail never recover

Omissions

It can be of 2 type as,
  1. Sending omission - not sending data where node is suppose to send according to algorithm
  2. Receiving omission - not receiving data when other node has sent message
Omission is actually a temporary state which eventually turns to a crash or a crash recovery state. Because it may recover from that state which will make it crash recovery and if it do not recovers then it becomes a crash stop.

Crash Recovery

  • A node might crash
  • It recovers after crashing and initiates a recovery process
Some crash recovery model also has stable storage. Nodes store state information in the stable storage and it can retrieves necessary information while it is in recovery state.

Byzantine / Arbitrary

A node may behave arbitrary (sending/receiving messages that are not specified by algorithm). This can be a malicious attack to the system.

Fault tolerance hierarchy

There is a hierarchy in the crash models. Like crash is a special type of omission, where the process never recovers. Omission is an intermediate state for crash recovery, where process runs it's recovery procedure and eventually recovers from the omission state.
And crash recovery can be a special case of byzantine failure since it allows any kind of deviation from the regular algorithm. So any kind of crash can be a special case of byzantine failure.


Figure: Fault tolerance hierarchy

References



1 comment: