Best Practices for Fault Tolerant Design in ITReducing Data Vulnerability in Information Networks
Two important ways an IT manager can incorporate fault tolerance into his or her organization are by building a distributed system or using mobile code.
NASA defines a fault tolerant approach as one that "expects failures to occur. However, their effects will be automatically counteracted by incorporating either redundancy or other types of compensation" (3). Applied to the field of information technology, fault tolerant concepts are usually associated with network design. Although using a distributed system or mobile code in an organization's information network differs from a "pure design redundancy approach" (3), these methodologies can minimize data loss. Distributed System ApproachThe distributed system approach typically involves "A non-centralized network consisting of numerous computers that can communicate with one another and that appear to users as parts of a single, large, accessible 'storehouse' of shared hardware, software, and data" ("Distributed System"). This incorporates one or more key fault-tolerant aspects: redundancy, replication, and / or diversity. Using traditional replication-based methods will ensure fault tolerance in that the "system has been designed to eliminate single points of failure" ("Redundancy"). One commonly known particularization of this principle is called "RAID" or Redundant Array of Inexpensive Disks. RAID is "A mechanism for providing data resilience for computer systems using mirrored arrays of magnetic disks. Different levels of RAID can be applied to provide for greater resilience" ("RAID"). However, the principle of redundancy can be applied to any component in a system and, essentially, to a system itself. While a purely redundant approach may rely solely on replicative fallbacks, redundancy in fault tolerant design makes "provisions . . . for planned degraded modes of operation where acceptable" (NASA 3). Replication, in a broader sense, can be defined as providing identical instances of an aspect of the system and / or directing tasks to one or both of those instances. Replication (and redundancy) provides "resilience," which is the ability of an aspect of a system to continue functioning after fault. In contrast, diversity provides multiple (yet distinct) versions of the same system aspect. Systemic diversity indicates that these aspects would be used like replicated / redundant components. Yet, the strength of diversity lies in the fact that the different implementations guard against having to experience the same faults again and again. Mobile Code ApproachThe mobile code approach is one in which the system is "structured in terms of programs that migrate from host to host" (Schneider). According to Cornell University, "the paradigm is ideal for supporting fault-tolerance and security" (Schneider). Mobile code is based on protocols that use cryptographic principles that, in turn, facilitate code migration. When mobile code becomes are more stable and widely used technology, it promises to be an elegant approach. Three Questions to Ask When Determining Levels of Fault Tolerance1. To what degree can the system continue to function with the component's capacity decreased/eliminated? 2. What is the extent to which the component in question might fail? 3. How costly will it be to engineer fault tolerance for a given system aspect? Overall, fault tolerant design can be immensely useful to the field of information technology. While distributed systems and mobile code may not be the perfect solution for every organization, networks that have not incorporated these aspects would probably benefit from doing so when data integrity and retention are of utmost importantce. References: "Distributed System." Glossary of Networking Terms for Visio IT Professionals. Online. 6 Mar. 2009. Available: http://www.microsoft.com/technet/archive/visio/visio2002/plan/glossary.mspx?mfr=true "Fault Tolerant Design." NASA: Preferred Reliability Practices. Online. 6 Mar. 2009. Available: http://www.klabs.org/DEI/References/design_guidelines/design_series/1246.pdf "RAID." IT & ITIL based Glossary of Terms. Online. 6 Mar. 2009. Available: http://servicedesk.unimelb.edu.au/knowledgebase/itservices/a-z/r.html#top "Redundancy." IT & ITIL based Glossary of Terms. Online. 6 Mar. 2009. Available: http://servicedesk.unimelb.edu.au/knowledgebase/itservices/a-z/r.html#top Schneider, F. Foundations and Support for Survivable Systems. Online. 6 Mar. 2009. Available: http://www.cs.cornell.edu/Info/People/fbs/Arpa.DIW96.smry.html
The copyright of the article Best Practices for Fault Tolerant Design in IT in Internet is owned by Michael Davis. Permission to republish Best Practices for Fault Tolerant Design in IT in print or online must be granted by the author in writing.
Related Articles
Related Topics
Reference
More in Technology
|