Known Issues

An overview of the known issues in this release of Opsview Cloud

Overview

The following issues have been identified to exist within this release of Opsview Cloud:

OS specific

Ubuntu 20

Re-enabling TLS 1.0 and TLS 1.1

Ubuntu 20 ships by default with TLS 1.0 and TLS 1.1 disabled. This means you may get errors when any OpenSSL libraries try to connect to external services.

Ideally, the external service should be upgraded to support TLS 1.2, but if that is not possible, then you can re-enable TLS 1.0 and TLS 1.1. Note, by doing this, you are reducing security.

To test the external service:

openssl s_client -connect SERVER:443 -tls1_2

This will fail if the external service does not support TLS 1.2.

To allow Ubuntu 20 to use TLS 1.0, edit /etc/ssl/openssl.cnf and add this at the top:

openssl_conf = openssl_configuration

Then add this at the bottom:

[openssl_configuration]
ssl_conf = ssl_configuration
[ssl_configuration]
system_default = tls_system_default
[tls_system_default]
MinProtocol = TLSv1
CipherString = DEFAULT:@SECLEVEL=1

Now check that connections will work.

Debian 8

During automated installation, "size of file" errors may occur, for example:

[E] apt-get: W: Size of file /var/lib/apt/lists/partial/downloads.opsview.com_opsview-commercial_6.5.4.202106072035_apt_dists_jessie_InRelease is not what the server reported 2038 138

Simply rerunning the installation command should resolve the issue and allow installation to proceed.

OS Generic

Upgrade/Installation

  • opsview-deploy package needs to be upgraded before running opsview-deploy to upgrade an Opsview Cloud System.
  • Changing the flow collectors configuration in Opsview Cloud currently requires a manual restart of the flow-collector component for it to start working again.
  • At upgrade, the following are not preserved:
    • Downtime: we recommend that you cancel any downtime (either active or scheduled) before you upgrade/migrate. Scheduling new downtime will work fine.
    • Flapping status: the state from pre-upgrade/migration is not retained but if the host/service is still flapping, the next checks will set the status to a flapping status again.
    • Acknowledgements: at the end of an upgrade/migration, the first reload removes the acknowledgement state from hosts and services. Any further acknowledgement will work as usual.
  • If you use an HTTP proxy in your environment, the TimeSeries daemons may not be able to communicate. You can work around this by adding export NO_PROXY=localhost,127.0.0.1 environment variable (note: this is in upper case, not lower case) to the opsview user .bashrc file
  • Hosts and services in downtime will appear to stay in downtime even when it is cancelled. You can work around this issue by creating a new downtime, wait until it starts and then cancel it, or add a downtime that lasts only for 5 minutes, and let it expire naturally
  • On rare occasions opsview-messagequeue may occasionally fail to upgrade correctly when running opsview-deploy. See MessageQueue Troubleshooting for steps to resolve the issue
  • The sync_monitoringscripts.yml playbook fails to execute whenever the SSH connection between the host where opsview-deploy is being run and the other instances is reliant on a user other than root and we only define the private SSH key using the ansible_ssh_private_key_file property in opsview_deploy.yml. This happens because the underlying rsync command is not being passed the private SSH key and thus fails to connect to the instances. To work around this issue add, in the root SSH configs. Consider the following example:
# If you use ansible_ssh_private_key_file on the opsview_deploy.yml file

(...)
collector_clusters:
  cluster-A:
    collector_hosts:
      ip-172-31-9-216:
        ip: 172.31.9.216
        user: ec2-user  
        vars:
          ansible_ssh_private_key_file: /home/ec2-user/.ssh/ec2_key
      ip-172-31-5-98:
        ip: 172.31.5.98
        user: ec2-user  
        vars:
          ansible_ssh_private_key_file: /home/ec2-user/.ssh/ec2_key
(...)

# You need to add the following entries to /root/.ssh/config

Host ip-172-31-9-216 172.31.9.216
    User ec2-user
    IdentityFile /home/ec2-user/.ssh/ec2_key
Host ip-172-31-5-98 172.31.5.98
    User ec2-user
    IdentityFile /home/ec2-user/.ssh/ec2_key

Plugins

  • There is no automated mechanism in this release to synchronize scripts between the Opsview Cloud Orchestrator and Collector Clusters. A sync_monitoringscripts.yml deploy playbook is provided to fulfil this purpose but it must be run manually or from cron on a regular basis.
  • check_wmi_plus.pl may error relating to files within your /tmp/* directory due to the ownership of these files needing to be updated to the Opsview user. Seen when upgrading from an earlier version of Opsview, as the nagios user previously ran this plugin.

Modules support

  • SMS Gateway is not available in this release. If you rely on this method, please contact Support.

Collectors and clusters

  • Despite the UI/API currently allowing it, you should not set parent/child relationships between the collectors themselves in any monitoring cluster; collectors do not have a dependency between each other and are considered equals.
  • When trying to Investigate a host, if you get an Opsview Web Exception error with "Caught exception in Opsview" message, this could be an indicator that the Cluster monitoring for that host has failed and needs you to address it.

Database changes

  • All database users created by Opsview will use the mysql_native_password authentication plugin (for MySQL 8, the default is usually caching_sha2_password)
  • The nightly backups of the opsview and runtime database are now based on the MySQL server’s preferred format, rather than a mysql40 compatible mode
  • When using utf8mb4, the collation difference from latin1 means some rows may come back in a slightly different order (eg, for latin1 check_snmp_weblogic_jmsqueuelength, check_snmp_weblogic_jsm_dests but utf8mb4 will be the other way round)

REST API

  • REST API config/OBJECT list calls: The ordering of results when using MySQL 8 is not necessarily deterministic, so REST API calls may need to specify a subsort field. Eg: for hosts, order=hostgroup.name is not sufficiently deterministic and will need to be order=hostgroup.name,id so that the results come back in a fixed order.

Other Issues

  • There is no option to set a new Home Page via the UI yet. For new installations, the Home Page is set to the Configuration > Navigator page.
  • Start and End Notifications for flapping states are not implemented in this release (when a Host or Service are flapping all notifications will be suppressed)
  • Deploy cannot be used to update the database root password. Root user password changes should be made manually and the /opt/opsview/deploy/etc/user_secrets.yml file updated with the correct password.
  • When a Host has been configured with 2 or more Parents and all of them are DOWN, the Status of the Services Checks on the host is set to CRITICAL instead of UNKNOWN. Consequently, the Status Information is not accurate either.
  • If an Opsview Cloud is configured to have UDP logging enabled in rsyslog, RabbitMQ will log at INFO level messages to opsview.log and syslog with a high frequency - 1 message every 20 seconds approximately.
  • Some components such as opsview-web and opsview-executor can log credential information when in Debug mode.
  • When running an Autodiscovery Scan via a cluster for the first time there must be at least one host already being monitored by that cluster. If the cluster does not monitor at least one host, the scan may fail with this message: "Cannot start scan because monitoring server is deactivated".
  • When running an Autodiscovery Scan for the first time after an upgrade, it may fail to begin and remain in the Pending state. To resolve this, simply restart the opsview-autodiscoverymanager component on the Opsview Master Server (orchestrator). After the component has restarted successfully, the scan will start.
  • You may get occasional errors appearing in syslog, such as:
Nov 28 16:31:50 production.opsview.com opsview-datastore[<0.6301.0>] req_err(2525593956) unknown_error : normal#012 
   [<<"chttpd:catch_error/3 L353">>,<<"chttpd:handle_req_after_auth/2 L319">>,<<"chttpd:process_request/1 L300">>,
   <<"chttpd:handle_request_int/1 L240">>,<<"mochiweb_http:headers/6 L124">>,<<"proc_lib:init_p_do_apply/3 L247">>]
  
# You can ignore them as there is no operation impact.
  • In order to get the SNMP Traps working on a hardened environment the following settings need to be changed:
# Add the following lines to /etc/hosts.allow
 
snmpd:ALL
snmptrapd:ALL
 
# Add the following lines to hosts.deny
 
snmpd: ALL: allow
snmptrapd: ALL: allow
  • Using Delete All on the SNMP Traps Exceptions page may sometimes hide new ones as they come in. They can by viewed again by changing the 'Page Size' at the bottom of the window to a different number.
  • CPU utilization is sometimes high due to the datastore.

AutoMonitor

  • When an AutoMonitor Windows Express Scan is set with a wrong, but still reachable, Active Directory Server IP or FQDN, the scan could remain in a "pending" state until it times out (1 Hour default value). This means that no other scans can run on the same cluster for that period of time. This is due to PowerShell not timing out correctly.
  • Automonitor automatically creates the Host Groups used for the scan: Opsview > Automonitor > Windows Express Scan > Domain. If any of these Host Groups already exist elsewhere in Opsview Cloud, then the scan will fail. If one of the Host Groups is moved then it should be renamed to avoid this problem.
  • If you have renamed your Opsview top level Host Group, the Automonitor scan will currently fail. You will need to rename this or create a new Opsview Host Group in order for the scan to be successful
  • Automonitor application on logout will clear local storage - this means that if a scan is in progress and a user logs out, when the user logs in they won't see that scans progress even if it's still running in the background
  • Any services already in dependency failure before upgrading to this release will not return to their previous state when leaving dependency failure, since that state will not have been saved. They will remain down until the next check occurs, as per the existing behaviour. However, any services that go into dependency failure after the upgrade has completed will follow the new recovery behaviour, as documented in Important Concepts .

Opspacks

  • Due to changes made to the Windows Active Directory Opspack, Windows hosts must now have a version of Powershell equal to or higher than version 5.0.
  • Due to the same Active Directory Opspack changes, setup-opsview.yml must be re-run to import the new Opspack plugin changes.
    • A reload must also be carried after to propagate the argument changes through to the collection plan for the Scheduler(s).
  • Windows Active Directory Opspack checks may increase CPU usage on the target Windows servers when running checks
  • Windows WMI - Base Agentless - LAN Status Servicecheck: Utilization values for Network adaptors byte send/byte receive rates are around 8 times lower than expected. Therefore, Warning and Critical thresholds should be adjusted accordingly as a workaround. See Plugin Change Log
  • Cloud - AWS related Opspacks: The directory /opt/opsview/monitoringscripts/etc/plugins/cloud-aws, which is the default location for aws_credentials.cfg file, is not created automatically by Opsview. Therefore, it needs to be created manually.
  • If opsview_tls_enabled is set to false, the Cache Manager component used by Application - Kubernetes and OS - VMware vSphere Opspacks will not work correctly on distributed environments
  • 'Hardware - Cisco UCS'. If migrating this Opspack over from an Opsview v5.x system it may produce error Error while trying to read configuration file or File "./check_cisco_ucs_nagios", line 25, in <module> from UcsSdk import * ImportError: No module named UcsSdk.
    If this is seen then running the following will resolve the issue
# as root
wget https://community.cisco.com/kxiwq67737/attachments/kxiwq67737/4354j-docs-cisco-dev-ucs-integ/862/1/UcsSdk-0.8.3.tar.gz
 
tar zxfv UcsSdk-0.8.3.tar.gz
cd UcsSdk-0.8.3
sudo python setup.py install

Place config file 'cisco_ucs_nagios.cfg' into the plugins path /opt/opsview/monitoringscripts/plugins/.

  • Opsview - Login is critical on a rehomed system. Resolve this by adding an exception to the Servicecheck on the Host specifying /opsview/login as the destination rather than /login.

Unicode Support

  • While inputting non-UTF-8 characters into Opsview Cloud will not generate any problem, the rendering of those characters in the user interface may be altered in places such as free text comments.

SNMP Traps

SNMPTraps daemons are started on all nodes within a cluster. At start-up a 'master SNMP trap node' is selected and is the only one in a cluster to receive and process traps. Other nodes silently drop traps.

The majority of SNMPTrap sending devices can at most send to 2 different devices.

The current (6.3) fix is to manually pick two nodes in a given cluster to act as the snmp trap and standby node. Then mark all other nodes within the cluster to not have the trap daemons installed, for example

collector_clusters:
  Trap Cluster:
    collector_hosts:
      traptest-col01: { ip: 192.168.18.53,  ssh_user: centos }
      traptest-col02: { ip: 192.168.18.157, ssh_user: centos }
      traptest-col03: { ip: 192.168.18.155, ssh_user: centos, vars: { opsview_collector_enable_snmp: False } }
      traptest-col04: { ip: 192.168.18.61,  ssh_user: centos, vars: { opsview_collector_enable_snmp: False } }
      traptest-col05:
        ip: 192.168.18.61
        ssh_user: centos
        vars:
          opsview_collector_enable_snmp: False

On a fresh installation the daemons will not be installed.

On an existing installation the trap packages must be removed and the trap demons on the 2 active nodes restarted to re-elect the master trap node

# INACTIVE NODES:
CentOS/RHEL: yum remove opsview-snmptraps-base opsview-snmptraps-collector
Ubuntu/Debian: apt-get remove opsview-snmptraps-base opsview-snmptraps-collector

# ACTIVE NODES:
/opt/opsview/watchdog/bin/opsview-monit restart opsview-snmptrapscollector
/opt/opsview/watchdog/bin/opsview-monit restart opsview-snmptraps
Undefined subroutine
Undefined subroutine &Opsview::Utils::SnmpInterfaces::Helper::pp called at 
/opt/opsview/monitoringscripts/lib/Opsview/Utils/SnmpInterfaces/Helper.pm line 270.

Seen if a large number of devices are being polled. The fix for this at the moment is to add the below line of code to the /opt/opsview/monitoringscripts/lib/Opsview/Utils/SnmpInterfaces/Helper.pm file on all your Opsview servers:

use Data::Dump qw( pp );