Hello all,
We have a weird issue with some graphs showing bigger or smaller gaps sometimes in interface traffic statistics. First objections looked like it is related to graphs or data sources that are part of an aggregate. We have some aggregates which were initially created by different ethernet bundle interfaces across several devices. We recognized the situation because the aggregates had suddenly huge traffic drops. The drops were not down to zero. So, it looked like traffic from some devices were missing. The traffic statistics for the devices are collected for both, the bundle interfaces and the member interfaces. During investigation we found out when the traffic drops in the aggregate occurs, the data on some devices for the bundle interfaces are missing. However, at the same time we saw the member interfaces with no issues. So, in general there is no problem collecting the data from the devices (SNMP timeout or whatever). Therefore, we thought it might be an issue related to the statistics of the bundle interface. We decided to change the aggregate and use the member interfaces instead of the bundle interfaces. In that moment the situation changed. Data collection for the bundle interfaces was fine and there were no gaps anymore. However, the member interfaces started having these gaps. So, the situation on the aggregates stays currently the same: There are bigger or smaller traffic drops which correspond to the gaps on single interfaces of the devices which are part of the aggerate. In general, it looks like the situation is worse within business hours. Outside business hours and on weekends the situation is very often OK and graphs look good.
Further investigations show that when we see gaps in single graphs (or traffic drop in aggregate) that the rrd file for the graphs are not updated. This should be happening about every 5 minutes. Sometimes theses updates stop. So, for sure this is the reason why we see those gaps and traffic drops. The question is now why rrd not getting updated?
We started further investigation and found out that the problem is somehow related to the fact whether somebody has opened the aggregates in his web browser and maybe keeps the browser open (maybe also in background) and refreshing the page with the graphs every five minutes automatically. Furthermore, I saw the issue once also on an interface that was not part of an aggregate but a single interface. However, the fact that I had the graph opened in my browser was the same (during a maintenance window watching the traffic of that particular interface and the next day doing the same). For now, this behavior was only seen once.
What we also found out yesterday that we sometimes see error messages in /var/log/messages generated by httpd saying "opening '<filename.rrd>': Permission denied". This is not happening not all the time, but for example we can see the message every 5 minutes when somebody has opened the graphs in a background tab. Coincidentally (?) the rrd files which appear in the "Permission denied" message are those files not getting updated. After about 5 minutes after logging out and closing the browser window it seems that updating the rrd files starts again (looking at the file date in the file system). Looking at the graphs again after that we can see data again. The permissions of the files seem to be OK. The webserver is allowed to read the rrd files and we can see the graphs with data which should be coming from those files.
We are talking about around 500 devices, 65000 graphs, and 70000 data sources. Not checked all of them but most of them looking fine. Those described above are only a small part.
Has somebody seen a similar situation and was able to fix it?
Br
mwg
Gaps in some graphs, maybe related to aggregates
Moderators: Developers, Moderators
Re: Gaps in some graphs, maybe related to aggregates
That's a lot of text without some key information. What Cacti version? Are you using Boost? What permissions on the RRD? What user is the webserver running as? What user is the poller running as?
What I expect is happening, is you have Boost enabled, and it is trying to write to the RRD when a person views it, but the webserver doesn't have permission to write to the file.
What I expect is happening, is you have Boost enabled, and it is trying to write to the RRD when a person views it, but the webserver doesn't have permission to write to the file.
Re: Gaps in some graphs, maybe related to aggregates
Thanks. That pointed into the right direction. It looks like the file permissions are a little bit screwed up. After setting them correctly, it looks like it is working. Yes, Boost is enabled. Do I understand this correctly: If On-demand RRD updating is enabled and somebody opens a graph and just leaves it open, the update process of the RRD is not performed by the poller any more until the user leaves the graph? So let's describe it like this: update is done in this case only on-demand (manual or automatic page refresh).
Re: Gaps in some graphs, maybe related to aggregates
With Boost, when a graph image is loaded, before it is created, Boost pulls the data from boost tables and inserts it into the RRD, then the graph is generated. If it can't write to the RRD, that data is lost because it will fail to write (actually we should probably check that first).
Re: Gaps in some graphs, maybe related to aggregates
This explains the situation. Thanks a lot again. It seems this really fixed the issue. So for me this is solved.
However I think there should really be a check wether data could be written before data is lost.
The situation on my site is related to let's say historic setup. The main issue is the fact that there is a user called cacti on our system and second the webserver is running on a different user (apache), which is in my eyes not a bad setup, if you have maybe have different applications on the webserver. Although cacti is in apache group and apache is in cacti group this will not finally fix the issue. Both users have low user id. Below 200 and it is a CentOS which uses umask 022 for IDs lower than 200. So files created by the Poller will not have permission for apache to write data and files created by apache will not be writeable by cacti user. In the directoy I found files belong to cacti:cacti rw-rw-r--, further files belong to cacti:cacti rw-r--r-- and last apache:apache rw-rw-r-- (actually the last one is related to the fact that a colleague was somehow going into the right direction and set the group rw manually via a cronjob). So if the poller has created a file initially with rw-r--r--, on demand updating will break everything, if the webserver is not running as cacti user on my site. If for some reason the apache user created the file, poller will not be able to write to the file.
Looking at the documentation there is nothing really said about which user context should be used. "Installing on CentOS 7" guide just gives a hint about apache user when looking at the cronjob to be created. "Installing Cacti 1.x in Ubuntu/Debian LAMP stack" guide tells about "For systemd unit's file install, you will need to modify the included units file to following your install location and desired user and group's to run the Cacti poller as." Following this guide tells me I can use a different user for the poller, but not taking care about the webserver. And honestly, looking on other reports about gaps from other people, I believe they having similar issue.
However I think there should really be a check wether data could be written before data is lost.
The situation on my site is related to let's say historic setup. The main issue is the fact that there is a user called cacti on our system and second the webserver is running on a different user (apache), which is in my eyes not a bad setup, if you have maybe have different applications on the webserver. Although cacti is in apache group and apache is in cacti group this will not finally fix the issue. Both users have low user id. Below 200 and it is a CentOS which uses umask 022 for IDs lower than 200. So files created by the Poller will not have permission for apache to write data and files created by apache will not be writeable by cacti user. In the directoy I found files belong to cacti:cacti rw-rw-r--, further files belong to cacti:cacti rw-r--r-- and last apache:apache rw-rw-r-- (actually the last one is related to the fact that a colleague was somehow going into the right direction and set the group rw manually via a cronjob). So if the poller has created a file initially with rw-r--r--, on demand updating will break everything, if the webserver is not running as cacti user on my site. If for some reason the apache user created the file, poller will not be able to write to the file.
Looking at the documentation there is nothing really said about which user context should be used. "Installing on CentOS 7" guide just gives a hint about apache user when looking at the cronjob to be created. "Installing Cacti 1.x in Ubuntu/Debian LAMP stack" guide tells about "For systemd unit's file install, you will need to modify the included units file to following your install location and desired user and group's to run the Cacti poller as." Following this guide tells me I can use a different user for the poller, but not taking care about the webserver. And honestly, looking on other reports about gaps from other people, I believe they having similar issue.
Who is online
Users browsing this forum: Rno and 2 guests