| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 9
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Not sure where this belongs as it really isn't a problem in the BOINC client, or the web site, or in any individual science. Seems like the best place for now and admins can move as necessary. Started getting segmentation violations in HSTB and ARP1 along with this error in FAH2:
----------------------------------------%IMPACT-E: Non-valid values generated from rrespa. This is probably because of bad initial geometry. Please run minimization process for some steps before running MD At first, thought it might have been due to running ClimatePrediction and WCG together on the same system. Prior to upgrading to 19.10, I ran 128 concurrent FAH2 tasks for months without problems. Only using 60% of the 256GB of memory so memory isn't stressed. Thought it might also be related to some incompatibility with the AMD EPYC processor. Interestingly, upgraded second server to 19.10 with an Intel processor and started getting the same errors on that system too. It's looking more and more like something related to the 19.10 upgrade (Linux Kernel is 5.3). Systems not upgraded are running the same mix of work with no problems. I may try to upgrade one more server with a different configuration and see if errors occur a third time. This is more of an FYI in case anyone is interested... [Edit 1 times, last edit by Doneske at Nov 12, 2019 3:16:21 PM] |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
OK, I will give you something to chew on. I noticed a few segmentation violations on CPDN on one machine, with 64 GB of memory (four 16 GB modules). I would link to them, but CPDN is down at the moment. I didn't see it on the other machines, with only two 16 GB modules.
----------------------------------------That leads me to believe that the use of four modules is occasionally unstable. I run them at the "rated" speed of 2666 MHz, but that is really defined for only two modules, and is not guaranteed for four. Therefore, even though you are seeing the segmentation errors on WCG and not CPDN, it could be that they are being pushed into the higher memory regions that are causing the problem. You might want to reduce the speed or the number of modules. On the other hand, I think I have seen an occasional segmentation error here on WCG too, but only on MIP or MCM thus far. But they are infrequent, and probably not due to memory on those machines. EDIT: The 64 GB of memory was on an i7-8700. I have no idea about EPYCs, but can imagine that they are picky about memory. My Ryzens are. EDIT2: All my machines are on Ubuntu 18.04.3. I wish WCG would show the details. [Edit 4 times, last edit by Jim1348 at Nov 13, 2019 4:40:34 PM] |
||
|
|
halldor.usa
Advanced Cruncher USA Joined: Nov 24, 2006 Post Count: 115 Status: Offline Project Badges:
|
I'm running Clear Linux on 2 machines, Linux Kernel is 5.3 and I am also seeing segment violation.
----------------------------------------I have seen that issue only with HST1. I've seen the error on every HST1 work unit (4 over last 2 weeks). I'm not sure when it started. I'm running ARP1 and MIP1 without errors. Error is: Result Name: HST1_ 305777_ 000099_ MC0018_ T400_ F00019_ S00022_ 1-- <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> SIGSEGV: segmentation violation </stderr_txt> ]]> [Edit 9 times, last edit by halldor.usa at Nov 13, 2019 8:18:00 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The EPYC server has 16 modules due to having 8 memory channels per processor. Only using the 1st slot in each channel so I'm getting the maximum speed. This machine was built to my workload specifications and tested at Dell at full load prior to shipping. I also ran 128 concurrent FAH2 tasks for months on this server prior to doing the upgrade with zero WU errors. Having said that, I originally thought it was probably a hardware error but now I'm not so sure. It passes all memory tests. Running CPDN and FAH2 does drive more memory utilization but it never gets over 75% utilization. I also thought it might be something related to some sort of cache pollution but have not been able to monitor that. Then I upgraded another server to 19.10 and got the same errors as the previous machine without any CPDN running at all, just WCG. I'm leaning towards something in the 5.3 kernel or a mis-match in some library code. The errors are quite sporadic and have only seen 6 between the 2 machines in the last 3 or 4 days.
|
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
It looks like you are on top of it. I will stay with 18.04.3.
|
||
|
|
halldor.usa
Advanced Cruncher USA Joined: Nov 24, 2006 Post Count: 115 Status: Offline Project Badges:
|
Doneske,
Depending on when you can next restart 1 of your machines, you might try to log into your Ubuntu using an earlier kernel. I would try that, but I just don't get enough HST1 work units. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I really can't reboot for a while as I have about 40 CPDN jobs running and they are very sensitive to interruptions. It will probably be January before I can restart either nachine
|
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
Here is my i7-8700 with the CPDN segmentation violations.
----------------------------------------https://www.cpdn.org/results.php?hostid=1492331 Note that it shows 32 GB, but that is the memory now. It was 64 GB (four modules) at the time of the errors. Whether the reduced memory fixes it is not clear yet; I will have to run more. And I see it occasionally on my Ryzens and i7-9700 too; they may have memory problems also. i7-9700: https://www.cpdn.org/results.php?hostid=1492821 I will have to look into it more. Maybe it is just a CPDN bug? If there is a lesson, it may be that Ubuntu 19.10 is no worse than the others overall. It may just happen to have the problems on certain WCG projects. But I have never found that a memtest picks up instability with multiple modules. It maybe tests each module individually? At any rate, they always show up as good. The last lesson is that running WCG or anything else with CPDN is probably asking for trouble. Their work started out on mainframes, and trying to adapt it to PCs has always been problematic. [Edit 6 times, last edit by Jim1348 at Nov 14, 2019 1:38:03 PM] |
||
|
|
halldor.usa
Advanced Cruncher USA Joined: Nov 24, 2006 Post Count: 115 Status: Offline Project Badges:
|
@Doneske,
----------------------------------------Yes, I understand. Good luck! [Edit 1 times, last edit by halldor.usa at Nov 14, 2019 1:41:58 PM] |
||
|
|
|