See www.zabbix.com for the official Zabbix site.
Troubleshooting
Contents
[hide]- 1 Compilation
- 2 AIX
- 3 Solaris
- 4 General
- 5 Escalation doesn't seem to stop
- 6 I'm not notified, though I should be
- 7 Help, my history_uint table is 300 GB and 2.2 billion records big! (PostgreSQL)
- 8 Zabbix says "Agent unreachable", but the host is up. How do I debug that?
- 8.1 What's the trigger expression like?
- 8.2 What's the argument to nodata()?
- 8.3 Did you install the agent? Is it running?
- 8.4 What's the item type? Are you getting data for the same type?
- 8.5 Agent hostname setting matches the Zabbix host in question?
- 8.6 TCP connection to server port? Basic operation?
- 8.7 TCP connection to agent port? zabbix_get?
- 8.8 Are you using a proxy for that host?
- 8.9 Are you zabbix_getting or monitoring by a domain name?
- 8.10 Using a domain name as "Servername"?
- 8.11 Does name resolution work on the monitored host?
- 8.12 Does name resolution work on the proxy/server?
- 8.13 Any long-running items?
- 8.14 Enough workers on the server/proxy side?
- 8.15 Agent debug level 4, dump and review traffic bilaterally, strace
- 9 Unstoppable Alerts
- 10 Some checks on Windows are not working while others are
- 11 SELinux
- 12 Trend data is missing
- 13 Frontend
Compilation
Java gateway compilation may fail with messages like these:
src/com/zabbix/gateway/BinaryProtocolSpeaker.java:48: error: ZabbixException cannot be resolved to a type public String getRequest() throws IOException, ZabbixException
src/com/zabbix/gateway/ZabbixException.java:24: warning: The serializable class ZabbixException does not declare a static final serialVersionUID field of type long class ZabbixException extends Exception
It is likely that gcc is used, which is not supported for Java in Zabbix. According to http://www.linuxfromscratch.org/blfs/view/svn/general/gcc-java.html, "since the release of OpenJDK, the development of GCC-Java has almost stopped". Latest news on https://gcc.gnu.org/java/ are from 2009. It is highly unlikely that support for gcc-java would be added in Zabbix, thus using OpenJDK is suggested.
AIX
- Getting response "[12] Not enough space" on AIX host for system.hostname and system.uname; sshd is running, SSH connections fail, though a listener is present; Other items successfully receive data. system.users.num is disabled, as it expects an integer.
The system might have run out of paging space. Also see http://www.ibm.com/developerworks/forums/thread.jspa?threadID=146853
Solaris
Gettext
Solaris expects the names of the locale folders to contain the used charset, for example "en_US.UTF-8" instead of "en_US". To make gettext work create a symlink from "en_US.UTF-8" to "en_US" (for example create symlink frontends/php/locale/ja_JP.UTF-8 pointing to frontends/php/locale/ja for Japanese translations to work).
General
- Broken NFS mounts can render the agent unusable after doing FS discovery, as the agent processes are thrown into uninterruptible sleep. You'll notice that df hangs as well.
- Setting too broad a filter on the Monitoring --> Latest Data page can cause that page to stop working. To fix, browse to http://yourserver.fq.dn/zabbix/latest.php?filter_rst=1
Escalation doesn't seem to stop
Symptom: Events that went to "OK" state long ago still seem to originate notifications. The action audit log confirms this.
There seems to be a race condition causing escalations to get stuck. You can look up the situation in the database in the following tables: escalations, actions, events and triggers.
Workaround: It seems to be enough to disable and re-enable the trigger connected to the stuck event. Alternatively, deleting the relevant records from the escalations table works too.
Ticket: ZBX-8175
I'm not notified, though I should be
- For pre-3.0, check
Administration -> Audit
, Dropdown in the upper right corner of the screen:Actions
- For 3.0, check
Reports -> Action log
- Read this wiki page on what can go wrong
- Consider to patch in the action simulator, available for 2.0, 2.2 and 3.0
If you don't find anything in the audit log, the problem is within your configuration. 2.4 and later shows more useful information there than in previous releases, namely the responsible action and user details other than the media details. Patches are available for 2.0 and 2.2 though. Alternatively you can query that information from the database. The actionid is just another field in the alerts table.
The action simulator will help you with your configuration, but it will not send actual messages, thus test your media scripts. It also has no sense of actual maintenance states.
Help, my history_uint table is 300 GB and 2.2 billion records big! (PostgreSQL)
Partitioning is the solution, but it's extremely difficult to partition that large of a table ex post, particularly while still running Zabbix on it. I gave up trying and came up with something very simple that works for me:
- Stop the Zabbix server
- Rename the history_uint table to history_uint_historic; Rename the indices too
- Create the original table and indices
- Set up partitioning
- Let history_uint_historic inherit from history_uint
- Start the Zabbix server
- Check that new data is going to the right partition
This way all the data remains available while new data is partitioned. Just drop history_uint_historic at some point.
If you were unfortunate enough to originally upgrade from a 1.8 DB, your tables are created with OIDs. Getting rid of them requires re-writes of the tables.
Zabbix says "Agent unreachable", but the host is up. How do I debug that?
File:Zabbix agent troubleshoot.dot
Please read the comments below on how to answer the questions in this decision tree! Follow the second tree if you end up on "Something else is wrong here"! Note: This work is not finished yet.
What's the trigger expression like?
Go to "Configuration/Hosts", navigate to the host in question and find the trigger; Alternatively, use the search to reach this place.
What's the argument to nodata()?
nodata() and other time-related functions are only evaluated on a 30 seconds basis or when new data arrives. If you poll the item like every two minutes, nodata(1m) doesn't make any sense. Even nodata(2m) is dubious, because you may experience a slight delay or plain miss one value and that would result in a very trigger-happy situation. See the documenation for further information!
Did you install the agent? Is it running?
There are a lot of ways to install or run the agent, depending on the operating system and/or distribution. At the end of the day, a passive agent needs a listener on some reachable interface on a TCP port. The following is intended for UNIX-like systems: Check with netstat -lntp! If the listener is not there and you can see the agent running on your process table, you probably configured it that way. Check the agent log on start-up details. If there is no agent log, there's a chance the PID file couldn't be created.
What's the item type? Are you getting data for the same type?
Investigate the item used in the trigger the same way you examined the trigger. Agent items are either "Zabbix agent" (=passive) or "Zabbix agent (active)" (=active). The difference is in who's creating the connection between server and agent. Make sure active/passive workers are actually present!
Agent hostname setting matches the Zabbix host in question?
Active agents rely on knowing the technical hostname as defined in the frontend. This is not bound to the domain name or something. Also don't confuse it with the "visible name"! This is only relevant for active agent items.
TCP connection to server port? Basic operation?
Use nc or telnet to connect the server's or proxy's listening port, commonly TCP 10051.
If no connection is made and the agent is really running, you've got a networking problem to solve: Routing, host or network firewalls; possibly security suites
TCP connection to agent port? zabbix_get?
Use nc
or telnet
to connect the agent's listening port, commonly TCP 10050. If the
handshake succeeds but the connection is immediately reset, the agent
doesn't like the IP address the connection is coming from. Review the Server
setting in the agent configuration file. Be aware if you have IPv6
enabled that "localhost" does not necessarily mean 127.0.0.1! If you're
using a domain name there, be aware that name resolution must be working
on the monitored host. If the connection succeeds, it will stay open
for as many second as defined by Timeout
in the agent configuration, 3 seconds by default. If you're quick, you
can type in a key and will get a response. Make sure you're actually
using the source IP address you think you're using! zabbix_get
is a more comfortable way to achieve the same. See the man page! Take a
look at how long it takes until you get your response. Different
reasons may lead to delays or misses. zabbix_get
invocations for agent.ping should only take a couple of milliseconds.
Are you using a proxy for that host?
Go to "Configuration/Hosts" and click the host name. That opens a form with a drop-down for "Monitored by proxy". If it says different than "no proxy", your host is monitored by a proxy and not the server directly. Make sure the proxy is set up properly and working. Also keep in mind, configuration changes aren't propagated to the proxy immediately. Mode and interval of updating configuration are different for active and passive proxies.
Are you zabbix_getting or monitoring by a domain name?
Check the interfaces of your host on "Configuration/Hosts" to see if you're using a name or an IP address. Items can also use names or IP addressed in their keys! zabbix_get allows the use of either an IP address or a host name.
Using a domain name as "Servername"?
Check your agent configuration file whether you're using a domain name on "Server" (or ServerActive).
Does name resolution work on the monitored host?
As IP has no concept of names, the agent has to perform a pointer look-up in order to keep the connection or drop it. If name resolution fails, so does the item processing. The agent is only running glibc's res_init() once: On agent start-up. That means, if you change your nsswitch.conf or resolv.conf, the agent won't notice. Round-robin won't save you from primary nameserver problems either, because the default name resolution timeout is too high for common Zabbix timeout settings. It will also happen on every single request. Try running a loop with zabbix-get, host or dig!
Does name resolution work on the proxy/server?
Proxy and server behave just as explained above for the agent, when it comes to name resolution initialisation. That said, it might be worth to run some sort of local DNS cache to relieve name servers. Depending on what you expect, this is surprisingly difficult. nscd and dnsmasq only cache results for particular DNS record types and name resolution functions. A local caching Bind server is the most complete solution.
Any long-running items?
If you're keeping your few agent workers busy, they may not find the time to process all requests. While you can crank up Timeout settings on server/proxy and agent side, carefully consider what you're doing there! Server/proxy settings are valid for all hosts they monitor, so that's particularly critical. Instead, don't create long-running UserParameters in the first place. If something needs time for processing, either pre-process it and just parse the results later or even use zabbix_sender to bulk-submit the data.
Enough workers on the server/proxy side?
Depending on the item type, pollers or trappers are responsible for requesting/receiving data. Some items have specialized poller and trapper processes. Some are not enabled by default. Review your server or proxy configuration if in doubt. For the server side, there are so-called internal items.
Zabbix 2.2 introduces these metrics for proxies as well.
https://www.zabbix.com/documentation/2.2/manual/config/items/itemtypes/internal
No metrics for the agent are available yet.
Agent debug level 4, dump and review traffic bilaterally, strace
Men at work
Debug level 4
Most of the time, this should be enough to eventually recognize what's troubling you:
- Timeouts
- Permission issues
- Name resolution issues
- Misconfiguration
- Connectivity issues
TODO: Routing/pointer record
How can I debug network-level problems?
To better understand how Zabbix communication works, first review the protocols.
The table below illustrates what communication usually looks like. This should give you a good foundation to recognize deviating behaviour. If the below is all gibberish to you, I recommend to read up on TCP/IP. I enjoyed Douglas E. Warner's TCP/IP Vol. 1.
The states on the table are meant at when the TCP segments go on the wire. We also assume that all segments actually reach their respective targets. You can only be sure if you dump the traffic on both sides of the connection at the same time. Use tcpdump, copy the dump files and review them on your desktop with Wireshark later. Use pcap filters with tcpdump to keep dumps small. See man pcap-filter!
Passive agent communication usually goes like this:
Number | Connection state agent | Connection state server | Direction | TCP flags | Purpose of TCP segment |
---|---|---|---|---|---|
1 | LISTEN | SYN_SENT | Agent<-Server | SYN | Initiate a TCP connection -- first step of a TCP handshake |
2 | SYN_RECVD | SYN_SENT | Agent->Server | SYN, ACK | Accept the connection |
3 | SYN_RECVD | ESTABLISHED | Agent<-Server | ACK | Connection is established |
4 | ESTABLISHED | ESTABLISHED | Agent<-Server | PSH, ACK | Zabbix item key is submitted |
5 | ESTABLISHED | ESTABLISHED | Agent->Server | ACK | Agent acknowledges the receipt |
6 | ESTABLISHED | ESTABLISHED | Agent->Server | PSH, ACK | The result is sent |
7 | FIN_WAIT_1 | ESTABLISHED | Agent->Server | FIN, PSH, ACK | The agent tears down the connection as he has no further data to send. |
8 | FIN_WAIT_1 | CLOSE_WAIT | Agent<-Server | ACK | |
9 | FIN_WAIT_2 | LAST_ACK | Agent<-Server | FIN, ACK | |
10 | TIME_WAIT | LAST_ACK | Agent->Server | ACK | The connection is now completely closed |
11 | CLOSED | CLOSED | - | - | Eventually the connection state is CLOSED on both sides. See footnote! |
- 1: TCP connections are unique by a tuple of address:port--address:port
- 2: If the SYN/ACK doesn't follow, the SYN is re-transmitted. Depending on TCP stack timeout settings and Zabbix configuration, it will be too late by then. The same principle applies for other datagrams too. That means, with default timeout settings, if you lose a TCP segment, odds are the connection is torn down (RST) and the attempt to gather data will be a failure.
- 3: At this point the connection is full-duplex and working.
- 4: The Push flag is set to transmit the datagram immediately, even though not all of its capacity is used. There just is nothing more to transmit and we want it now.
- 7: Nothing else to do; closing time! The fact that the agent is closing the connection leaves his side in the TIME_WAIT state afterwards. Take a look at the finite TCP state machine for details! Contrary to the initial handshake, 4 segments instead of 3 are involved when properly closing a TCP connection.
- 8: The datagram with the FIN flag is immediately acknowledged. Just then the application (Zabbix server in this case) is informed about the closing.
- 9: As soon as the application has handled this event, it sends its FIN segment too.
- 10: And #9 is immediately acknowledged too. This is the end of a usable connection.
- 11: There's no 11th datagram. This row is there to illustrate the eventual state changes. While the server side will immediately go to CLOSED after it processed the datagram from row 10, the agent side will hang in TIME_WAIT for about 2 minutes. That means, that this particular connection is not usable for that period.
The payload of this communication is only something like:
system.cpu.load[percpu,avg5] ZBXD.........0.022917
TODO: What happens if the timeout appears while the connection is already torn down?
Real-world network problems encountered
Tale of the randomly missing datagrams
As mentioned above, a single missing datagram is often enough to cause the connection to be torn down, due to a Zabbix timeout. If you tcpdump on the agent and server side at the same time, you've got all the data you need. Did one side really send the datagram and send it in a timely manner? Did the other side receive it and when? If you see any re-transmitted datagrams, something went wrong. It's not like your network is broken just because a datagram goes missing or arrives late or in a different order than it was sent in. That's exactly why we have TCP: TCP is there to create a reliable connection across the non-reliable and connection-less network an IP network is by its nature. If you encounter loss once in a while, it's no reason to worry. Of course the number to expect relies on the type and quality of the underlying network. However, be alarmed if loss rates are high. If it happens in a very random fashion, it can be really hard to spot.
- Symptom: A fair rate of lost datagrams on connections to various hosts
- Cause: A hardware firewall with either buggy firmware or some fault
Ephemeral issues
To be done
Unstoppable Alerts
Occasionally, Zabbix may start sending alerts uncontrollably due to a misconfiguration or other alerting issue. This results in an alerter process that is 100% busy while the alerts are going out.
This has occurred when a log file spanning years suddenly gets re-read and millions of alerts were generated. These alerts did not show up anywhere after a while, beyond spamming the email address where the alerts was destined.
There are a few ways to stop this when it occurs, however it all requires the alerts be sent through an external script.
# chmod -x the external script to stop the script from running # rename or move the script altogether to stop the script from running
Some other possibilities are, however these may not work if the alert has already been queued up at the alerter process
- Disabling triggers
- Disabling actions
- Disabling media types
There is no obvious way of stopping this when the emails are being sent through the internal mail handler.
Silencing generated alerts
If Zabbix has generated a lot of alerts to send and is sending out old alerts for a long time, you can check how many alerts it is expecting to send (ones already saved in the database):
select count(*) from alerts where status=0 and alerttype=0;
Then set all the unsent alerts to failed:
update alerts set status=2 where status=0 and alerttype=0;
This marks all the unsent alerts (type 0) as failed (type 2).
Some checks on Windows are not working while others are
Symptoms, any of:
- Windows box with Zabbix agent (1.8 or higher) not returning appropriate value whether from active/passive checks or zabbix_get.
- I'm getting "ZBX_NOTSUPPORTED: Cannot obtain system information." on a host that accepts the proper item check.
- zabbix agent log: check_counter_path(): cannot make counterpath
- zabbix agent log: A required argument is missing or not correct.... active check "perf_counter[....]" is not supported
Reason: Some of the performance monitor counters have gone corrupt.
Solution:
Perform the following command after going into the directory C:\WINDOWS\System32 directory by typing CD\ then CD Windows\System32
lodctr /r
Restart the zabbix agent and check again using zabbix_get.
SELinux
RHEL 6.6
zabbix22 from EPEL
Symptom: Proxy or server doesn't start
Reason: Proxy and Server now log to /var/log/zabbixsrv
Solution:
semanage fcontext -a -t zabbix_log_t /var/log/zabbixsrv(/.*)? restorecon -r /var/lib/zabbixsrv
Trend data is missing
With this hacky script you can rebuild trend data (only for trend_uint) on MySQL:
<?php
$itemId = 25590; // itemId to rebuild
// connect to database
$mysqli = new mysqli("localhost", "username", "password", "database");
if ($mysqli->connect_errno) {
echo "Failed to connect to MySQL: (" . $mysqli->connect_errno . ") " . $mysqli->connect_error;
}
// delete trendData for itemId
$queryDelete = "DELETE from trends_uint WHERE itemid = ".$itemId;
$mysqli->query($queryDelete);
// get data from history and "calculate" trend data
$query = "select MIN(value) as value_min, AVG(value) as value_avg, MAX(value) as value_max, HOUR( FROM_UNIXTIME(clock) ) as theHour, DATE( FROM_UNIXTIME(clock) ) as theDate from history_uint WHERE itemid = $itemId GROUP BY DATE( FROM_UNIXTIME(clock) ), HOUR( FROM_UNIXTIME(clock) );";
if ($result = $mysqli->query($query)) {
while ($row = $result->fetch_assoc()) {
$dateTime = $row['theDate'].' '.$row['theHour'].':00:00';
$clock = strtotime($dateTime);
$min = floor($row['value_min']);
$avg = floor($row['value_avg']);
$max = floor($row['value_max']);
$query = "INSERT INTO trends_uint VALUES ($itemId, ".$clock.", 3600, ".$min.", ".$avg.", ".$max.")";
$mysqli->query($query);
}
$result->free();
}
$mysqli->close();
Frontend
Frontend failing with "Error in query" might be caused by problems with the sessions table, like the table missing completely.
MySQL case sensitivity issues
If searches are case insensitive or renaming an object with only case changes fail, most likely database has not been created properly - collate was not set to the correct value. To fix this, the following can be used:
ALTER DATABASE zabbix DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
for table in $(echo "show tables;" | mysql -N zabbix); do
echo "ALTER TABLE $table CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;" | mysql -N zabbix
done
Stuck at "You are not logged in"
It may be possible to get stuck trying to log in to the Zabbix web UI
with a message of "You are not logged in" saying "You must log in to
view this page". If you are unable to reach the Zabbix web log-in form
to input your user name and password and you are using php-fpm and/or
nginx, it could be due to the PATH_INFO
setting. You may want to try commenting out (by prepending with #) the lines within your web server configuration that set PATH_INFO
such as in /etc/nginx/snippets/fastcgi-php.conf
, for example:
#set $path_info $fastcgi_path_info;
#fastcgi_param PATH_INFO $path_info;
You will subsequently need to reload or restart your web server software in order for any configuration change to take effect. However, do take steps to ensure that this does not negatively impact other websites that you may be hosting.