NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
Moderators: Developers, Moderators
-
- Posts: 6
- Joined: Sun Oct 11, 2020 3:37 pm
NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
There is a great project https://sourceforge.net/projects/nvgpu-smi-snmp/ that pulls NVIDIA GPU data from nvgpu-smi and makes available to SNMPD. It is running and good to go:
I can grab all the NVIDIA IOD's via snmpwalk locally and remotely:
craig@pop-os:~/Documents/nvidia/nvgpu-smi-snmp-1.1$ snmpwalk -c public -v2c localhost 1.3.6.1.4.1.2021.13.42.2
iso.3.6.1.4.1.2021.13.42.2.1.1.0 = INTEGER: 0
iso.3.6.1.4.1.2021.13.42.2.1.2.0 = STRING: "GeForce GTX 1060 3GB"
iso.3.6.1.4.1.2021.13.42.2.1.3.0 = STRING: "86.06.59.00.69"
iso.3.6.1.4.1.2021.13.42.2.1.4.0 = STRING: "418.152.00"
iso.3.6.1.4.1.2021.13.42.2.1.8.0 = INTEGER: 3016
iso.3.6.1.4.1.2021.13.42.2.1.10.0 = INTEGER: 54
iso.3.6.1.4.1.2021.13.42.2.1.13.0 = INTEGER: 102
iso.3.6.1.4.1.2021.13.42.2.1.24.0 = INTEGER: 253
iso.3.6.1.4.1.2021.13.42.2.1.25.0 = INTEGER: 405
iso.3.6.1.4.1.2021.13.42.2.1.26.0 = STRING: "GPU-ae167601-2ba0-890c-7aca-aa14ba4deb47"
iso.3.6.1.4.1.2021.13.42.2.1.27.0 = INTEGER: 8
iso.3.6.1.4.1.2021.13.42.2.1.28.0 = INTEGER: 0
iso.3.6.1.4.1.2021.13.42.2.1.29.0 = INTEGER: 0
iso.3.6.1.4.1.2021.13.42.2.1.30.0 = INTEGER: 1815
iso.3.6.1.4.1.2021.13.42.2.1.31.0 = INTEGER: 12
iso.3.6.1.4.1.2021.13.42.2.1.32.0 = INTEGER: 120
All is configured in Cacti and seems like "Should be creating graphs" - but the graphs are blank.
Running Debug on the graphs gets the dreaded "Valid Data X"
Looking deeper "Data Source returned Bad Results for snmp_oid" But But But? It's good! Don't get why the result would be bad?
craig@pop-os:~$ snmpwalk -c public -v2c localhost .1.3.6.1.4.1.2021.13.42.2.1.10
iso.3.6.1.4.1.2021.13.42.2.1.10.0 = INTEGER: 54
craig@pop-os:~$ snmpwalk -c public -v2c localhost NV-CTRL-MIB::nvCtrlGPUCoreTemp
NV-CTRL-MIB::nvCtrlGPUCoreTemp.0 = INTEGER: 56
More details
When running the debug got this in cacti.log
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 6
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 7
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 10
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 8
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 9
When running this is the error over and over in cacti.log
10/11/2020 15:20:01 - POLLER: Poller[1] WARNING: Invalid Response(s), Errors[5] Device[1] Thread[1] DS[6, 7, 8, 9, 10]
10/11/2020 15:20:02 - SYSTEM STATS: Time:1.2265 Method:cmd.php Processes:1 Threads:0 Hosts:1 HostsPerProcess:1 DataSources:10 RRDsProcessed:10
This is all running on same machine. Very cool to get graphs of GPU utilization - Let's Rock This, Fix IT, and Share with the Cacti-Universe!
Thanks!
Craig
I can grab all the NVIDIA IOD's via snmpwalk locally and remotely:
craig@pop-os:~/Documents/nvidia/nvgpu-smi-snmp-1.1$ snmpwalk -c public -v2c localhost 1.3.6.1.4.1.2021.13.42.2
iso.3.6.1.4.1.2021.13.42.2.1.1.0 = INTEGER: 0
iso.3.6.1.4.1.2021.13.42.2.1.2.0 = STRING: "GeForce GTX 1060 3GB"
iso.3.6.1.4.1.2021.13.42.2.1.3.0 = STRING: "86.06.59.00.69"
iso.3.6.1.4.1.2021.13.42.2.1.4.0 = STRING: "418.152.00"
iso.3.6.1.4.1.2021.13.42.2.1.8.0 = INTEGER: 3016
iso.3.6.1.4.1.2021.13.42.2.1.10.0 = INTEGER: 54
iso.3.6.1.4.1.2021.13.42.2.1.13.0 = INTEGER: 102
iso.3.6.1.4.1.2021.13.42.2.1.24.0 = INTEGER: 253
iso.3.6.1.4.1.2021.13.42.2.1.25.0 = INTEGER: 405
iso.3.6.1.4.1.2021.13.42.2.1.26.0 = STRING: "GPU-ae167601-2ba0-890c-7aca-aa14ba4deb47"
iso.3.6.1.4.1.2021.13.42.2.1.27.0 = INTEGER: 8
iso.3.6.1.4.1.2021.13.42.2.1.28.0 = INTEGER: 0
iso.3.6.1.4.1.2021.13.42.2.1.29.0 = INTEGER: 0
iso.3.6.1.4.1.2021.13.42.2.1.30.0 = INTEGER: 1815
iso.3.6.1.4.1.2021.13.42.2.1.31.0 = INTEGER: 12
iso.3.6.1.4.1.2021.13.42.2.1.32.0 = INTEGER: 120
All is configured in Cacti and seems like "Should be creating graphs" - but the graphs are blank.
Running Debug on the graphs gets the dreaded "Valid Data X"
Looking deeper "Data Source returned Bad Results for snmp_oid" But But But? It's good! Don't get why the result would be bad?
craig@pop-os:~$ snmpwalk -c public -v2c localhost .1.3.6.1.4.1.2021.13.42.2.1.10
iso.3.6.1.4.1.2021.13.42.2.1.10.0 = INTEGER: 54
craig@pop-os:~$ snmpwalk -c public -v2c localhost NV-CTRL-MIB::nvCtrlGPUCoreTemp
NV-CTRL-MIB::nvCtrlGPUCoreTemp.0 = INTEGER: 56
More details
When running the debug got this in cacti.log
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 6
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 7
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 10
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 8
10/11/2020 14:40:03 - DSDEBUG Bad Data Found for Data Source ID 9
When running this is the error over and over in cacti.log
10/11/2020 15:20:01 - POLLER: Poller[1] WARNING: Invalid Response(s), Errors[5] Device[1] Thread[1] DS[6, 7, 8, 9, 10]
10/11/2020 15:20:02 - SYSTEM STATS: Time:1.2265 Method:cmd.php Processes:1 Threads:0 Hosts:1 HostsPerProcess:1 DataSources:10 RRDsProcessed:10
This is all running on same machine. Very cool to get graphs of GPU utilization - Let's Rock This, Fix IT, and Share with the Cacti-Universe!
Thanks!
Craig
All About Me http://radiantcreators.com/about
-
- Posts: 6
- Joined: Sun Oct 11, 2020 3:37 pm
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
Help! : )
All About Me http://radiantcreators.com/about
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
What Cacti version? Everything looks correct. Can you post what your Data Template looks like (instead of the Data Source)?
-
- Posts: 6
- Joined: Sun Oct 11, 2020 3:37 pm
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
Version 1.2.10
Not using Spine.
Thanks for reply. Hope I am sending the right info requested.
All the graphs created GPU* are using the "SNMP - Generic OID Template"
The "SNMP - Generic OID Template" is all default
All About Me http://radiantcreators.com/about
-
- Posts: 6
- Joined: Sun Oct 11, 2020 3:37 pm
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
From the logs:
10/12/2020 23:50:02 - POLLER: Poller[1] WARNING: Invalid Response(s), Errors[5] Device[1] Thread[1] DS[6, 7, 8, 9, 10]
10/12/2020 23:50:02 - SYSTEM STATS: Time:1.2417 Method:cmd.php Processes:1 Threads:0 Hosts:1 HostsPerProcess:1 DataSources:10 RRDsProcessed:10
Wish there was a way to "SEE" what the "Invalid Response" is.
When doing an snmpwalk a Valid response seems the case
craig@pop-os:~$ snmpwalk -c public -v1 localhost .1.3.6.1.4.1.2021.13.42.2.1.10
iso.3.6.1.4.1.2021.13.42.2.1.10.0 = INTEGER: 48
10/12/2020 23:50:02 - POLLER: Poller[1] WARNING: Invalid Response(s), Errors[5] Device[1] Thread[1] DS[6, 7, 8, 9, 10]
10/12/2020 23:50:02 - SYSTEM STATS: Time:1.2417 Method:cmd.php Processes:1 Threads:0 Hosts:1 HostsPerProcess:1 DataSources:10 RRDsProcessed:10
Wish there was a way to "SEE" what the "Invalid Response" is.
When doing an snmpwalk a Valid response seems the case
craig@pop-os:~$ snmpwalk -c public -v1 localhost .1.3.6.1.4.1.2021.13.42.2.1.10
iso.3.6.1.4.1.2021.13.42.2.1.10.0 = INTEGER: 48
All About Me http://radiantcreators.com/about
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
I seem to recall an issue when using "SNMP - Generic OID Template" but don't remember what version it was in.
What I would do instead, is create a new "Data Source Template" using "Get SNMP Data" as the method and put the OID in there. Then change the Graph Template to use that Data Template instead. Delete your old graphs and recreate them. I just did it this way for some Tripplite PDU graphs without issue on 1.2.14 a few days ago.
Do you have your Logging set to MEDIUM? You should be getting the actual results in the logs if so.
What I would do instead, is create a new "Data Source Template" using "Get SNMP Data" as the method and put the OID in there. Then change the Graph Template to use that Data Template instead. Delete your old graphs and recreate them. I just did it this way for some Tripplite PDU graphs without issue on 1.2.14 a few days ago.
Do you have your Logging set to MEDIUM? You should be getting the actual results in the logs if so.
-
- Posts: 6
- Joined: Sun Oct 11, 2020 3:37 pm
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
Thanks, hacking at it now. Will reply when get something working.cigamit wrote: ↑Tue Oct 13, 2020 1:30 am I seem to recall an issue when using "SNMP - Generic OID Template" but don't remember what version it was in.
What I would do instead, is create a new "Data Source Template" using "Get SNMP Data" as the method and put the OID in there. Then change the Graph Template to use that Data Template instead. Delete your old graphs and recreate them. I just did it this way for some Tripplite PDU graphs without issue on 1.2.14 a few days ago.
Do you have your Logging set to MEDIUM? You should be getting the actual results in the logs if so.
All About Me http://radiantcreators.com/about
-
- Posts: 6
- Joined: Sun Oct 11, 2020 3:37 pm
Re: NVIDIA GPU Graphs ALMOST Working (Empty Graphs)
Thanks so much for the help, have been hacking at this. Not quite getting it working. Could I trouble you to list the steps to make graphs this way?cigamit wrote: ↑Tue Oct 13, 2020 1:30 am I seem to recall an issue when using "SNMP - Generic OID Template" but don't remember what version it was in.
What I would do instead, is create a new "Data Source Template" using "Get SNMP Data" as the method and put the OID in there. Then change the Graph Template to use that Data Template instead. Delete your old graphs and recreate them. I just did it this way for some Tripplite PDU graphs without issue on 1.2.14 a few days ago.
Do you have your Logging set to MEDIUM? You should be getting the actual results in the logs if so.
THANKS!
Craig
All About Me http://radiantcreators.com/about
Who is online
Users browsing this forum: No registered users and 2 guests