Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Nagios - System and Network Monitoring.pdf
Скачиваний:
314
Добавлен:
15.03.2015
Размер:
6.48 Mб
Скачать

Introduction

It’s ten o’clock on Monday morning. The boss of the branch office is in a rage. He’s been waiting for hours for an important e-mail, and it still hasn’t arrived. It can only be the fault of the mail server; it’s probably hung yet again. But a quick check of the computer shows that no mails have got stuck in the queue there, and there’s no mention either in the log file that a mail from the sender in question has arrived. So where’s the problem?

The central mail server of the company doesn’t respond to a ping. That’s probably the root of the problem. But the IT department at the company head office absolutely insists that it is not to blame. It also cannot ping the mail node of the branch office, but it maintains that the network at the head office is running smoothly, so the problem must lie with the network at the branch office. The search for the error continues. . .

The humiliating result: the VPN connection to head office was down, and although the ISDN backup connection was working, no route to the head office (and thus to the central mail server) was defined in the backup router. A globally operating IT service provider was responsible for the network connections (VPN and ISDN) between branch and head office, for whom something like this “just doesn’t happen”. The end result: many hours spent searching for the error, an irritated boss (the meeting for which the e-mail was urgently required has long since finished), and a sweating admin.

With a properly configured Nagios system, the adminstrator would already have noticed the problem at eight in the morning and been able to isolate its cause within a few minutes. Instead of losing valuable time, the IT service provider would have been informed directly. The time then required to eliminate the error (in this case, half an hour) would have been sufficient to deliver the e-mail in time.

A second example: somewhere in Germany, the hard drive on which the central Oracle database for a hospital stores its log files reaches full capacity. Although this does not cause the “lights to go out” in the operating room, the database stops working and there is considerable disruption to work procedures: patients

15

Introduction

cannot be admitted, examination results cannot be saved, and reports cannot be documented until the problem has been fixed.

If the critical hard drive had been monitored with Nagios, the IT department would have been warned at an early stage. The problem would not even have occurred.

With personnel resources becoming more and more scarce, no IT department can really afford to regularly check all systems manually. Networks that are growing more and more complex especially demand the need to be informed early on of disruptions that have occurred or of problems that are about to happen. Nagios, the Open Source tool for system and network monitoring, helps the administrator to detect problems before the phone rings off the hook.

The aim of the software is to inform administrators quickly about questionable (WARNING) or critical conditions (CRITICAL). What is regarded as “questionable” or “critical” is defined by the administrator in the configuration. A Web page summary then informs the administrator of normally working systems and services, which Nagios displays in green, of questionable conditions (yellow), and of critical situations (red). There is also the possibility of informing the administrators in charge—depending on specific services or systems—selectively by e-mail but also by paging services such as SMS.

By concentrating on traffic light states (green, yellow, red), Nagios is distinct from network tools that display elapsed time graphically (for example in the load of a WAN interface or a CPU throughout an entire day) or that record and measure network traffic (how high was the proportion of HTTP on a particular interface?). Nagios is involved plainly and simply with the issue of whether everything is on a green light. The software does an excellent job in looking after this, not just in terms of the current status but also over long periods of time.

The tests

When checking critical hosts and services, Nagios distinguishes between host and service checks. A host check tests a computer, called host in Nagios slang, for reachability—as a rule, a simple ping is used. A service check selectively tests individual network services such as HTTP, SMTP, DNS, etc., but also running processes, CPU load, or log files. Host checks are performed by Nagios irregularly and only where required, for example if none of the services to be monitored can be reached on the host being monitored. As long as one service can be addressed there, then this is basically valid for the entire computer, so that this test can be dropped.

The simplest test for network services consists of looking to see whether the relevant target port is open, and whether a service is listening there. But this does not necessarily mean that, for example, the SSH daemon really is running on TCP port 22. Nagios therefore uses tests for many services that go several steps further. For SMTP, for example, the software tests whether the mail server also announces itself

16

Introduction

with a “220” output, the so-called SMTP greeting; and for a PostgreSQL database, it checks whether this will accept an SQL query.

Nagios becomes especially interesting through the fact that it takes into account dependencies in the network topology (if it is configured to do so). If the target system can only be reached through a particular router that has just gone down, then Nagios reports that the target system is “unreachable”, and does not bother to bombard it with further host and service checks. The software puts administrators in a position where they can more quickly detect the actual cause and rectify the situation.

The suppliers of information

The great strength of Nagios—even in comparison with other network monitoring tools—lies in its modular structure: the Nagios core does not contain one single test. Instead it uses external programs for service and host checks, which are known as plugins. The basic equipment already contains a number of standard plugins for the most important application cases. Special requests that go beyond these are answered—provided that you have basic programming knowledge—by plugins that you can write yourself. Before you invest time developing these, however, it is first worth taking a look in the Internet and browsing through the relevant mailing lists,1 as there is lively activity in this area. Ready-to-use plugins are available, especially in the Nagios exchange platform, http://www.nagiosexchange.org/.

A plugin is a simple program—often just a shell script (Bash, Perl etc.)—that gives out one of the four possible conditions OK, WARNING, CRITICAL, or (with operating errors, for example) UNKNOWN.

This means that in principle Nagios can test everything that can be measured or counted electronically: the temperature and humidity in the server room, the amount of rainfall, the presence of persons in a certain room at a time when nobody should enter it. There are no limits to this, provided that you can find a way of providing measurement data or events as information that can be evaluated by computer (for example, with a temperature and humidity sensor, an infrared sensor, etc.). Apart from the standard plugins, this book accordingly introduces further freely available plugins, such as the use of a plugin to query a temperature and humidity sensor in Chapter 19 from page 377.

Keeping admins up-to-date

Nagios possesses a sophisticated notification system. On the sender side (that is, with the host or service check) you can configure when which group of persons— the so-called contact groups—are informed about which conditions or events (fail-

1 http://www.nagios.org/support/mailinglists.php

17

Introduction

ure, recovery, warnings etc.). On the receiver side you can also define on multiple levels what is to be done with a corresponding message—for example whether the system should forward it, depending on the time of day, or discard the message.

If a specific service is to be monitored seven days a week round the clock, for example, this does not mean that the administrator in charge will never be able to take a break: instead, you can instruct Nagios to notify the person only from Mondays to Fridays between 8am and 5pm, every two hours at the most. If the administrator in charge is not able to solve the problem within a specified period of time, eight hours for example, then the head of department responsible should receive a message. This is also known as escalation management. The corresponding configuration is explained in Chapter 12.5 from page 231.

Nagios can also make use of freely configurable, external programs for notifications, so that you can integrate any system you like: from e-mail to SMS to a voice server that the administrator calls up and receives a voice message concerning the error.

With its Web interface (Chapter 16 from page 273, Nagios provides the administrator with a wide range of information, clearly arranged according to the issues involved. Whether the admin needs a summary of the overall situation, a display of problematic services and hosts and the causes of network outages, or the status of entire groups of hosts or services, Nagios provides an individually structured information page for nearly every purpose.

Through the Web front end, an administrator can inform colleagues upon accepting a particular problem so that they can concentrate on other things that have not yet been seen to. Information already obtained can be stored as comments on hosts and services, just like scheduled downtimes: Nagios prevents false alarms going off in these periods.

By reviewing past events, the Web interface can reveal what problems occurred in a selected time interval, who was informed, what the situation was concerning the availability of a host and/or services during a particular time period—all this also taking account of downtimes, of course.

Taking in information from outside

For tests, notifications, etc., Nagios makes use of external programs, but the reverse is also possible: through a separate interface (see 13.1 from page 240), independent programs can send status information and commands to Nagios. The Web interface makes widespread use of this possibility, which allows the administrator to send interactive commands to Nagios. But a backup program unknown to Nagios can also transmit a success or failure to Nagios, as well as a syslog daemon—there is no limit to the possibilities here.

18

Introduction

Thanks to this interface, Nagios allows distributed monitoring. This involves several decentralized Nagios installations sending their test results to a central instance, which then helps to maintain an overview of the situation from a central location.

Other tools for network monitoring

Nagios is not the only tool for monitoring systems and networks. The most wellknown “competitor,” perhaps on an equal footing, is Big Brother (BB). Despite a number of differences, its Web interface also serves the same purpose as that of Nagios: displaying to the administrator what is in the “green area” and what is not.

The reason why the author uses Nagios instead of Big Brother lies in the license for Big Brother, on the BB homepage2 called Better Than Free License: the product continues to be commercially developed and distributed. If you use BB and earn money with it, you must buy the software. The fact that the software, including the source code, may not be passed on or modified except with the explicit permission of the vendor means that it cannot be reconciled with the criteria for Open Source licenses. This means that Linux distributors have their hands tied.

For the graphical display of certain measured values over a period of time, such as the load on a network interface, CPU load, or the number of mails per minute, there are other tools that perform this task better than Nagios. The original tool is certainly the Multi Router Traffic Grapher MRTG,3 which, despite growing competition, still enjoys great popularity. The relatively young, but very powerful alternative is called Cacti4: this has a larger range of applications, can be configured via Web interface, and avoids the restrictions in MRTG, which can only display two measured values at the same time and cannot display any negative values.

Nagios itself can also display performance data graphically, using extensions (Chapter 17 from page 313). In many cases this is sufficient, but for very dedicated requirements, the use of Nagios in tandem with a graphic representation tool such as MRTG or Cacti is recommended.

About This Book

This book is directed at network administrators who want to find out about the condition of their systems and networks using an Open Source tool. It describes Nagios version 2.0, which is somewhat different from its predecessors in its configuration. The plugins, on the other hand, lead their own lives, are to a great extent independent of Nagios, and are therefore not restricted to a particular version.

2

http://www.bb4.org/

3

http://www.mrtg.org/

4

http://www.cacti.net/

19

Introduction

Even though this book is based on Linux as the operating system for the Nagios computer, this is not a requirement. Most descriptions also apply to other Unix systems,5 only system-specific details such as start scripts need to be adjusted accordingly. Nagios currently does not work under Windows, however.

The first part of this book deals with getting Nagios up and running with a simple configuration, but one that is sufficient for many uses, as quickly as possible. This is why Chapters 1 through 3 do not have detailed descriptions and treatments of all options and features. These are examined in the second part of the book.

Chapter 4 looks at the details of service and host checks, and in particular introduces their dependency on network topologies.

The options available to Nagios for implementing service checks and obtaining their results is described in Chapter 5.

This is followed by the presentation of individual standard plugins and a number of additional, freely obtainable plugins: Chapter 6 takes a look at the plugins that inspect the services of a network protocol directly from the Nagios host, while Chapter 7 summarizes plugins that need to be installed on the machine that is being monitored, and for which Nagios needs additional utilities to get them running. Several auxiliary plugins, which do not perform any tests themselves, but manipulate already established results, are introduced in Chapter 8.

Two utilities that Nagios requires to run local plugins on remote hosts are introduced in the two subsequent chapters: in Chapter 9 the SSH is described, while Chapter 10 introduces a daemon developed specifically for Nagios.

Wherever networks are being monitored, SNMP also needs to be implemented. Chapter 11 not only describes SNMP-capable plugins but also examines the protocol and the SNMP world itself in detail, providing the background knowledge needed for this.

The Nagios notification system is introduced Chapter 12, which also deals with notification using SMS, escalation management, and taking account of dependencies.

The interface for external commands is discussed in Chapter 13; this forms the basis of other Nagios mechanisms, such as the Nagios Service Check Acceptor (NSCA), a client-server mechanism for transmitting passive test results, covered in Chapter 14. The use of this is shown in two concrete examples—integrating syslog-ng and processing SNMP traps. NSCA is also a requirement for distributed monitoring, discussed in Chapter 15.

Even though you may have already used the Web interface, you might still be wondering about all the detailed options that this offers. Chapter 16 tries to answer this question as completely as possible, supported by very helpful screenshots. It

5For example, *BSD, HP-UX, AIX, and Solaris; the author does not know of any Nagios versions running under MacOS X.

20

Introduction

also describes a series of parameters which until now have not been documented anywhere, except in the source code.

Although in its operation, Nagios concentrates primarily on traffic light signals (red-yellow-green), there are ways of evaluating and representing the performance data provided by plugins, which are described in detail in Chapter 17.

Networks are rarely homogeneous, that is, equipped only with Linux and other Unix-based operating systems. For this reason Chapter 18 demonstrates what utilities can be used to integrate and monitor Windows systems.

Chapter 19 uses the example of a low-cost hardware sensor to show how room temperature and humidity can be monitored simply yet effectively.

Nagios can also monitor proprietary commercial software, as long as mechanisms are available which can query states of the system integrated into a plugin. In Chapter 20, this is described using an SAP-R/3 system.

The appendix Nagios Configuration introduces all the parameters of the two central configuration files nagios.cfg and cgi.cfg, while Rapidly Changing States: Flapping and EventHandler are devoted to some useful but somewhat exotic features.

Further notes on the book

At the time of going to press, Nagios 2.0 is close to completion. When this book is on the market, there could well be some modifications. Relevant notes, as well as corrections, in case some errors have slipped into the book, can be found at http://linux.swobspace.net/books/nagios/.

Note of Thanks

Many people have contributed to the success of this book. My thanks go first of all to Dr. Markus Wirtz, who initiated this book with his comment, “Why don’t you write a Nagios book, then?!”, when he refused to accept my Nagios activities as an excuse for delays in writing another book. I would also like to thank the two technical editors, Steffen Waitz and Jorg¨ Linge, for their support. A very special thanks goes to Patricia Jung, who, as the technical editor for the German language version, overhauled the manuscript and pestered me with thousands of questions— which was a good thing for the completeness of the book, and which has ultimately made it easier for the reader to understand.

21

From Source Code to a Running Installation

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]