Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Burgess M.Principles of network and system administration.2004.pdf
Скачиваний:
163
Добавлен:
23.08.2013
Размер:
5.65 Mб
Скачать

324

CHAPTER 8. DIAGNOSTICS, FAULT AND CHANGE MANAGEMENT

in expediting multiple connections (many multi-threaded servers set limits on the number of threads allowed, so as not to run a machine into the ground in the event of spamming). These measures help to reduce the need for retransmission of TCP segments and timeouts on connection. Assuming that the network interface is working as fast as it can (see previous section), a server will then respond as quickly as it can.

8.12 Principles of quality assurance

Quality assurance in service provision is a topic that is increasingly discussed in the world of network services (see section 10.8), but quality assurance is a process that has far wider implications than the commercially motivated issue of value for money. A system administrator also performs a service for the system and for users. Quality assurance take up three related issues:

Accuracy of service (result)

Efficiency of service (time)

Predictability (result/time).

8.12.1ISO 9000 series

The ISO 9000 series of standards represent an international consensus on management practices that apply to any process or organization. The aim of the standards is to provide a schematic quality management system and a framework for continual assessment and improvement. ISO 9000 has become quite important in some sectors of industry, in the countries that have adopted it.

First published in 1987, the ISO 9000 standards are widely used and a quick search of the net reveals that they are also money-making enterprises. Courses in these methods are numerous and costly. The principles, however, are straightforward. The idea is that a standard approach to quality assurance leads to less uncertainty in the outcome. Quality is associated with certainty. Here, we shall not dwell on the issue of ISO 9000 certification, but rather on the guiding principles that the standard embodies.

8.12.2Creating a quality control system

Quality is clearly a subjective criterion. It is a matter for policy to decide what quality means. Quality control is an iterative process, with a number of key elements. It is a process, rather than a one-off task, because the environment in which we execute our work is never static. Even as we plan our quality handbooks and verification forms, the world is changing and has made them partially obsolete.

Principle 51 (Rapid maintenance). The speed of response to a problem can be crucial to its success or failure, because the environment is constantly changing the conditions for work. If one procrastinates, procedures will be out of date, or inappropriate.

8.12. PRINCIPLES OF QUALITY ASSURANCE

325

ISO 9000 reiterates one of the central messages of system administration and security: namely that they are on-going, dynamical processes rather than achievable goals (see figure 8.16).

Determine quality goals: One begins by determining policy: what is it that we wish to accomplish? Until we know this, we cannot set about devising a strategy to accomplish the goals.

Assess the current situation: We need to know where we stand, in order to determine how to get where we are going. How much work will it take to carry out the plan?

Devise a strategy: Strategy determination is a complex issue. Sometimes one needs to back-track in order to go forward. This is reminiscent of the story of the stranger who comes to a city and asks a local how to get to the post office. The local shakes his head and replies ‘If I were going to the Post Office, I certainly wouldn’t start from here’. Clearly, this is not a helpful observation. We must always find a way to achieve our goals, even if it means first back-tracking to a more useful starting point.

Project management: How we carry out a process is at least as important as the process itself. If the process is faulty, the result will be faulty. Above all, there must be progress. Something has to happen in order for something good to happen. Often, several actors collaborate in the execution of a project. Projects cost resources to execute – how will this be budgeted? Are resources adequate for the goals specified?

Documentation and verification: A key reason for system failure is when a system becomes so complex that its users can no longer understand it. Humans, moreover, are naturally lazy, and their performance with regard to a standard needs to be policed. Documentation can help prevent errors and misunderstandings, while verification procedures are essential for ensuring the conformance of the work to the quality guidelines.

Fault-handling procedure: Quality implies a line between the acceptable and unacceptable. When we discover something that falls short of the mark, we need a procedure for putting the problem right. That procedure should itself be quality assured, hence we see that quality assurance has a feedback structure. It requires self-assessment.

In principle 40, we found that standardization leads to predictability. It can also lead to limitations, but we shall assume that this problem can also be dealt with by a quality assurance programme.

The formulation of a quality assurance scheme is not something that can be done generically; one needs expert insight into specific issues, in order to know and evaluate the limitations and likely avenues for error recovery. Quality Assurance involves:

1.A definition of quality.

2.A fault tree or cause tree analysis for the system quality.

326

CHAPTER 8. DIAGNOSTICS, FAULT AND CHANGE MANAGEMENT

Policy goals

Strategic plan

 

Quality definition

 

 

 

Procedures

 

Verification

Methods

 

Documentation

 

 

 

Figure 8.16: Elements of a quality assurance system.

3.Formulating a strategic remedial policy.

4.The formalization of remedies as a checklist.

5.Acknowledging and accepting inherent system limitations.

6.Checklists to document compliance with policy.

7.Examination of results and feedback into policy.

Measurements of tolerances, uncertainties and limitations need to be incorporated into this procedure in a continual feedback process. Quality is achieved through this continued process: it is not an achievable goal, but rather a never-ending journey.

Exercises

Self-test objectives

1.What is meant by the principle of predictable failure?

2.Explain the meaning of ‘single point of failure’.

3.Explain how a meshed network can be both more robust and more susceptible to failure.

4.What is the ‘small worlds’ phenomenon and how does it apply to system administration?

5.Explain the principle of causality.

6.What is meant by an interaction?

7.How do interactions underline the importance of the principle of causality?

EXERCISES

327

8.What is meant by the environment of a system?

9.How does one find the boundary between system and environment?

10.What kind of faults can occur in a human–computer system?

11.Describe some typical strategies for finding faults.

12.Describe some typical strategies for correcting faults.

13.Explain how a cause tree can be used help locate problems in a system. What are the limitations of cause-tree analysis?

14.Explain how fault trees can provide predictive power for the occurrence of faults. What are the limitations of this predictive power?

15.Explain the relationship between change management and cause-tree analysis.

16.Explain the role of game theory in system management. Comment on its limitations.

17.Explain how game theory reveals the principle of communities by finding optimal equilibria.

18.What role does monitoring the system play in a rational decision-making process?

19.Explain the weakest link principle in performance analysis.

20.Explain how competition for resources can lead to wasted resources.

21.What is ISO 9000?

22.Describe some of the issues in quality control.

23.Explain how the rate of maintenance affects the likely state of a system.

Problems

1.Find out about process priorities. How are process priorities changed on the computer systems on your network? Formulate a policy for handling processes which load the system heavily. Should they be left alone, killed, rescheduled etc?

2.Describe the process you would use to troubleshoot a slowly running host. Formalize this process as an algorithm.

3.Suppose you are performance tuning, trying to find out why one host is slower than another. Write a program which tests the efficiency of CPU-intensive work only. Write programs which test the speed of memory-intensive work and disk-intensive work. Would comparing the time it takes to compile a program on the hosts be a good way of comparing them?

328CHAPTER 8. DIAGNOSTICS, FAULT AND CHANGE MANAGEMENT

4.Determine the network transmission speed on the servers on your network. Are they as high as possible? Do they have auto-detection of the interface transmission rates on their network connections (e.g. 10Mb/s or 100Mb/s)? If not, how are they configured? Find out how you can choose the assumed transmission rate.

5.What is meant by an Ethernet collision? How might doubling the speed of all hosts on an Ethernet segment make the total system slower?

6.Consider the fault tree in figure 8.17.

Data loss

OR

Read error

Physical damage

Timing hole

Software error

 

OR

AND

 

 

 

 

Magnet Heat Crinkle Sched. RAID?

Figure 8.17: Partial fault tree for data loss due to backup failure.

(a)Given that the probability that data will be lost in a backup hole (data changed between scheduled backups) is approximately the same as the probability of physical media damage, what strategy would you suggest for improving security against data loss? Explain your answer.

(b)What security principle does RAID employ to protect data? Explain how RAID might be used at several places in this tree in order to help prevent data loss.

(c)Describe a fault tree for loss of service in a high availability web server placed in a server room. Describe how you would go about estimating the probabilities. Based on your analysis, concoct a number of long-term strategies for countering these failures; draw a provisional payoff matrix for these strategies versus the failure modes, and use this to estimate the most cost-effective long-term strategies.

(d)Design a change plan and schedule for upgrading 400 Windows hosts. Your plan should include a fault tree analysis for the upgrade and contingency plan for loss of some of the hosts.

EXERCISES

329

7.Today CPU power is cheap, previously it was common for organizations to have to load users and services onto a single host with limited CPU.

(a)Describe as many strategies as you can that you might use to prevent users from hogging CPU-intensive services.

(b)Now imagine all of the possible strategies that selfish users might use to hog resources and describe these.

(c)Would you say that CPU is a zero-sum resource, i.e. that what is lost by one user is gained by the others?

(d)Estimate or argue the relative payoff to the selfish user for each of the pairs of strategies used by both parties, and thereby construct the payoff matrix for the system.

(e)By inspection, find the defensive strategies that minimize the payoff to the user.

(f)Use the minimax theorem to find the optimal strategy or strategies and compare your answer with the one you chose by inspection.