{"id":129,"date":"2018-08-19T12:52:01","date_gmt":"2018-08-19T12:52:01","guid":{"rendered":"http:\/\/nitk.acm.org\/blog\/?p=129"},"modified":"2018-08-19T12:52:01","modified_gmt":"2018-08-19T12:52:01","slug":"group-membership-protocols","status":"publish","type":"post","link":"https:\/\/nitk.acm.org\/blog\/2018\/08\/19\/group-membership-protocols\/","title":{"rendered":"Group Membership Protocols"},"content":{"rendered":"<p id=\"6620\" class=\"graf graf--p graf-after--h3\"><strong class=\"markup--strong markup--p-strong\">What are Group Membership Protocols?<\/strong><\/p>\n<p id=\"b306\" class=\"graf graf--p graf-after--p\">In data centers, failures are the norm rather than the exception. Google fellow Jeff Dean offers an overview of Google\u2019s data center operations:<\/p>\n<p id=\"66de\" class=\"graf graf--p graf--startsWithDoubleQuote graf-after--p\"><em class=\"markup--em markup--p-em\">\u201cIn each cluster\u2019s first year, it\u2019s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will \u201cgo wonky,\u201d with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there\u2019s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.\u201d<\/em><\/p>\n<p id=\"6816\" class=\"graf graf--p graf-after--p\">We need a mechanism to detect failures and disseminate the failure information through the network. This is where Group Membership Protocols come into picture. We give every process a membership list. A membership list contains the list of all processes along with its last heartbeat. (Heartbeat refers to the last time the process P communicated with the current process.)<\/p>\n<p id=\"7ee0\" class=\"graf graf--p graf-after--p\">Now that every process has a membership list, we need to detect failures. Once a failure is detected, this process needs to be removed from all membership lists. How do we achieve this?<\/p>\n<p id=\"3e13\" class=\"graf graf--p graf-after--p\"><strong class=\"markup--strong markup--p-strong\">Heartbeat Protocol<\/strong><\/p>\n<p id=\"aa15\" class=\"graf graf--p graf-after--p\">A process P sends a heartbeat(message) to process Q telling that it is alive. If P does not receive the heartbeat, it assumes that process Q is dead.<\/p>\n<p id=\"cc54\" class=\"graf graf--p graf-after--p\">The topology could be varied. All processes might send heartbeats to a centralized node or they could send it to all other processes.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-131 aligncenter\" src=\"https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha1.png\" alt=\"\" width=\"789\" height=\"446\" srcset=\"https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha1.png 789w, https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha1-300x170.png 300w, https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha1-768x434.png 768w\" sizes=\"auto, (max-width: 789px) 100vw, 789px\" \/><\/p>\n<p id=\"a289\" class=\"graf graf--p graf-after--p\">There are some very obvious disadvantages here. Centralized Heartbeat-ing is dependent on one central process. If that fails, everything fails. All to All Heartbeat-ing has an increased overhead of sending heartbeats to all the processes.<\/p>\n<p id=\"7f0b\" class=\"graf graf--p graf-after--p\">So how do we optimise?<\/p>\n<p id=\"02c4\" class=\"graf graf--p graf-after--p\"><strong class=\"markup--strong markup--p-strong\">Gossip- Style Membership<\/strong><\/p>\n<p class=\"graf graf--p graf-after--p\">Every process randomly selects n processes and sends them a heartbeat. Each of these processes will update their membership lists and send out a heartbeat to n processes again randomly selected. If the heartbeat does not increase even after T_fail seconds, the process is considered fail. It waits for another T_cleanup seconds before it deletes the member from the membership list.<\/p>\n<figure id=\"attachment_133\" aria-describedby=\"caption-attachment-133\" style=\"width: 599px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-133\" src=\"https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha2.png\" alt=\"\" width=\"599\" height=\"285\" srcset=\"https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha2.png 599w, https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha2-300x143.png 300w\" sizes=\"auto, (max-width: 599px) 100vw, 599px\" \/><figcaption id=\"caption-attachment-133\" class=\"wp-caption-text\">Gossip Membership Protocol (by Indranil Gupta)<\/figcaption><\/figure>\n<p id=\"227a\" class=\"graf graf--p graf-after--p\">A single heartbeat takes O(log N) time to propagate and N heartbeats take O(N log(N)) time.<\/p>\n<p id=\"8903\" class=\"graf graf--p graf-after--p\">Can we optimize further?<\/p>\n<p id=\"00a8\" class=\"graf graf--p graf-after--p\"><strong class=\"markup--strong markup--p-strong\">SWIM Failure Detector Protocol<\/strong><\/p>\n<p class=\"graf graf--p graf-after--p\">SWIM (Scalable Weakly-consistent Infection-style) uses membership lists along with a protocol period T. A process Pi randomly picks a process Pj in its membership list and pings it. If it does not receive an ACK within the timeout period, it selects k other targets and uses them to send a message to Pj. The ACK is passed on from these k targets to Pi. If Pi doesn\u2019t receive an ACK within T, it declares Pj as failed.<\/p>\n<figure id=\"attachment_134\" aria-describedby=\"caption-attachment-134\" style=\"width: 761px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-134\" src=\"https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha3.png\" alt=\"\" width=\"761\" height=\"318\" srcset=\"https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha3.png 761w, https:\/\/nitk.acm.org\/blog\/wp-content\/uploads\/2018\/08\/sunitha3-300x125.png 300w\" sizes=\"auto, (max-width: 761px) 100vw, 761px\" \/><figcaption id=\"caption-attachment-134\" class=\"wp-caption-text\">SWIM Protocol (by Indranil Gupta)<\/figcaption><\/figure>\n<figure id=\"a3c4\" class=\"graf graf--figure graf-after--p\">\n<div class=\"aspectRatioPlaceholder is-locked\">\n<div><\/div>\n<\/div>\n<\/figure>\n<div class=\"aspectRatioPlaceholder-fill\">Data centers use more complex failure prediction and detection models which use these protocols as the building blocks.<\/div>\n<div><\/div>\n<div><\/div>\n<div style=\"text-align: right;\"><em>&#8211; Sunitha Selvan<\/em><\/div>\n","protected":false},"excerpt":{"rendered":"<p>What are Group Membership Protocols? In data centers, failures are the norm rather than the exception. Google fellow Jeff Dean offers an overview of Google\u2019s data center operations: \u201cIn each cluster\u2019s first year, it\u2019s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[10],"tags":[],"class_list":["post-129","post","type-post","status-publish","format-standard","hentry","category-tech"],"_links":{"self":[{"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/posts\/129","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/comments?post=129"}],"version-history":[{"count":3,"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/posts\/129\/revisions"}],"predecessor-version":[{"id":135,"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/posts\/129\/revisions\/135"}],"wp:attachment":[{"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/media?parent=129"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/categories?post=129"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nitk.acm.org\/blog\/wp-json\/wp\/v2\/tags?post=129"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}