Important Concepts

A description of the important concepts and ideas within Opsview

Overview

This pages lists important concepts and idea you should understand to make full use of Opsview

Hosts and Services

A service is something that is important to you, that you want to know the status of. Services can be "active" (checked on a regular basis) or "passive" (waits to be given data). This document will focus only on "Active" Service Checks.

All Services (also called a Check or a Service Check) have a status, one line of output and (optionally) some performance data.

Hosts are a logical grouping for a set of services (normally associated with a single device on a network). Services have to be associated with a host - they cannot exist without a host.

Services are regularly checked based upon their check interval. Each Service is checked independently of other Services and all have their own timing schedules.

Hosts are also checked. This can be on a regular basis (if its host check interval is defined). The Host will also be checked “on-demand” by the monitoring engine whenever a Service has changed state.

Note: Hosts without any services will not be shown in the monitoring status pages.

See more details within the Hosts, Host Groups and Host Check Commands section.

States

Services have one of 4 possible States:

  • OK - Everything is fine
  • CRITICAL - Something is wrong
  • WARNING - Something may be wrong
  • UNKNOWN - There is some internal error with the check such as incorrect parameters, or there is a dependency failure

The last 3 states are collectively called Problem States.

Hosts have one of 3 possible States:

  • UP - Host is okay
  • DOWN - Host has a problem
  • UNREACHABLE - All parents of this host are in a failure state. This is a calculated state based on the parent/child relationship dependency of a Host

If a Host is DOWN, then the Services on the Host will be marked as CRITICAL with the summary text of "Dependency failure: Host X is DOWN". Service will no longer be executed until the Host has returned to an UP state.

If a Host is UNREACHABLE, then it will be marked with the summary text of "Dependency failure: Host X is DOWN". The Host will not be checked again until at least one of its parents is UP. All Services on the Host will not be checked until the Host returns to an UP state.

Plugins

All active checks use a plugin. This plugin will have the actual logic to know how to check something to determine its Status. For example, a plugin will know how to communicate with a DNS server, or how to interrogate for free filesystem space, or how to get a web page.

The same plugin can be used many times for different services. It takes parameters to determine what to check or what the threshold levels are.

The parameters available are dependent on the plugin used.

After a plugin has run, it must return a status code to Opsview - which maps to one of the OK, WARNING, CRITICAL or UNKNOWN statuses. It must also return some summary text.

The plugin may also return some optional performance data which Opsview will record and can later be used in performance graphing.

Opsview supports Nagios compatible plugins.

See more details within the Active Checks section.

Check Intervals and State Types

When the active check for a service runs, it is executed on a set frequency (by default 5 minutes). This is called the check interval.

Usually, services are in an OK state, showing that service is stable. However, if a problem occurs and the service changes to a different state, we need to have confidence that this is the correct state. We use state types to highlight this confidence factor.

Services can have one of two state types:

  • Hard - when a service has been in a specific state for a number of checks
  • Soft - when a service has just switched to a different state

There are two important parameters to determine the soft and hard state types:

  • retry interval - during a soft state, the next scheduled check will be after this interval, rather than the check interval
  • maximum check attempts - this is the number of times a check has to be in the same state before it becomes a hard state type

The check attempts will be displayed as 3/5, which means the third check with a maximum of fix before it becomes hard.

When a service has gone into a hard state type, then the check attempts will revert to 1.

Note: If a service changes from one problem state to another, the check attempts are reset.

This same logic also applies for an OK state.

The main reason for state types is that notifications are sent on hard states only. This avoids sending notifications for temporary problems.

Notifications

Notifications are sent on hard state changes only. This means notifications will be sent for hosts or services only when they have been in a particular state for a "check attempts * retry interval" amount of time.

Notifications are also sent when a host/service returns to an OK hard state. This is called a hard recovery notification.

Notifications are executed in parallel.

Notifications are suppressed if:

  • the host/service is in downtime (for a planned outage)
  • the host/service is in an acknowledged state (for an unplanned outage)

See more details Notifications {Link to KC Section}

Event Handlers
Event handlers are an external script that is executed when a result is returned. There are three possible options:

  • No event handler defined
  • Event handler, with "Always execute" off - the event handler will execute after every check in a problem state, including the first state change back to OK/UP
  • Event handler with "Always execute" on - the event handler will execute for every check, regardless of state

Event handlers are executed in parallel.

See more details within the Event Handlers section.

Lifecycle of a Service

This shows the lifecycle of a service, which transitions from a WARNING to a CRITICAL back to OK state. This assumes the service is run every 31s, with a retry interval of 20s. Max check attempts is 3:

TimeStateCheck Attempt
14:08:08OK1/3
14:08:39WARNING1/3
14:08:59WARNING2/3
14:09:19WARNING3/3
14:09:50WARNING1/3
14:10:21CRITICAL1/3
14:10:41CRITICAL2/3
14:11:01CRITICAL3/3
14:11:32CRITICAL1/3
14:12:03CRITICAL1/3
14:12:34OK1/3
14:12:54OK2/3
14:13:14OK3/3
14:13:45OK1/3

Dependency Failure

When setting up Hosts and Host Services in Opsview Cloud, it is possible to set up dependency relationships such that the state of one object can affect the state of the objects below it in a dependency tree. To learn more about setting up object dependencies read Active Checks - Details Tab: Advanced.

Parent/Child relationship example

One example of where a Parent and Child relationship would be useful to set up would be a relationship between a Virtual Machine (VM) management server host and the VMs running under that host. In this case the VM management server host would be set as the parent host, and the VMs running on it would be the children. If the parent management host goes DOWN, the children VMs would be set to UNREACHABLE with a message on the investigate window indicating the dependency failure and the parent host responsible. Likewise, the service checks of both parent and child hosts will go into dependency failure and will be set to CRITICAL.

An example of a useful Parent and Child relationship applied to service checks would be one between a parent SNMP agent check and child SNMP interface checks. The parent service checks that the snmp agent is up, while the children monitor the snmp interfaces. Should the snmp-agent fail, there is no point in monitoring the interfaces as these won’t be up. Thus by having the SNMP agent be the parent, we ensure that when it goes CRITICAL the monitoring of the children services is halted and they go to UNKNOWN state.

What triggers a dependency failure?

A dependency failure is triggered on a monitoring object when its direct parent or a parent object higher up the dependency tree changes to a failure state. For a parent Host, this means going into a DOWN or UNREACHABLE state and for a parent Service, this means going into a CRITICAL state.

Dependency failure behavior

Summary Messages and State Change

A monitoring object in dependency failure will have its status information updated displaying the reason for the object to enter into this state. When the parent host goes DOWN, the child hosts go into UNREACHABLE state and the status information on the Investigate Window will show "Dependency failure: {Parent Host Name} is DOWN".

When the parent host goes DOWN or UNREACHABLE, the service checks go to CRITICAL state and the status information will say "Dependency failure: {Parent Host Name} is DOWN". Opsview tracks the dependency failure to the highest parent host that caused it, so even service checks belonging to the child hosts will reference that parent as the cause of the dependency failure.

When a parent service goes CRITICAL, the child services will go to UNKNOWN state and the status information will say "Dependency failure: {Parent Service Name} is CRITICAL".

No active checks

An object under dependency failure will not have any active checks run. This means it will stay under dependency failure until its parent(s) exit their failure states.

Setting off a manual recheck on the object will run the check a single time. When the check is due to run again, the object re-enters into the dependency failure state.

Freshness Checking & Stale Service Checks

When enabled, freshness checking will ensure that results have been received recently for service checks. If a result has not recently been received (see here for configuration options), then the service check will be marked as Stale.

For active service checks and SNMP checks, freshness checking will only occur in the time period configured for the service check.

For passive checks, freshness checking will only occur in the time period configured for the host that the service check belongs to.

Notifications and Event Handlers

Because a Host or Service under Dependency Failure is disabled, no new checks are scheduled for them. This means that no notifications or event handlers will run for these objects.

Recovery from dependency failure

When a host or service recovers (returns to a hard UP/OK state) its child services return to active checking, losing the “Dependency failure:” summary message. When this happens to a service, the state that it was in prior to going into dependency failure is restored as per the following diagram:

The restored state will contain the added prefix “Restored status: “, with the last check time matching the time when the service recovered from dependency failure:

The restored state will not include any performance data from the prior state:

However, if a passive check result comes in while the service is in dependency failure, then that state will overwrite the dependency failure and saved state, since the result is newer than both of these.

Note that if a host moves to another cluster while it has service checks in dependency failure, they will lose any stored state information and will be unable to restore in this manner until an active check next occurs, or a passive result is received.