| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 86
|
|
| Author |
|
|
rcthardcore
Cruncher United States Joined: Jan 29, 2009 Post Count: 13 Status: Offline Project Badges:
|
No problems on any of my other BOINC projects. No problems with any of my games. My OS has never crashed. No faulty hardware. Everything verified as working as intended. Process of elimination in action. MCM has some faulty code somewhere. It doesn't happen on every work unit. I have troubleshooted every electronic part on my motherboard, memory chips, cpu, gpu, etc. No problems whatsoever. No overheating problems. Case has 9 fans. CPU and GPU are liquid cooled. I do not get paid to do QA work for the project. It is not my job to do that. Not my responsibility.
----------------------------------------
AMD Ryzen 9 5950x
NVIDIA RTX 3090 FE 128 GB DDR4-3200 Windows 10 64-bit 21H1 |
||
|
|
Grumpy Swede
Master Cruncher Svíþjóð Joined: Apr 10, 2020 Post Count: 2498 Status: Offline Project Badges:
|
Thousands upon thousands of people with all kind of computers and OS, is running MCM, without any stuck units. Then you complain about stuck units on your computer, and all of a sudden your conclusion is that there's faulty code in MCM.
----------------------------------------Yeah, that's likely....NOT!! Don't you think that others "en masse" would have complained about that by now? We've been so far returned 1,640,407,729 MCM results, running the MCM code for 729019 Years, 229 days, 11 hours, 57 minutes, 22 seconds, and not many people have complained about any stuck units. Logical conclusion: It's your computer that can't handle the strain of the MCM code, not the MCM code by itself. [Edit 4 times, last edit by Grumpy Swede at Sep 30, 2021 9:40:33 PM] |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
@rcthardcore
Actually, whilst you may well be right (though I suspect any issue is more likely to be in the support wrapper than the actual MCM1 application...) you haven't eliminated the possibility that on your system the combination of work you're running happens to find an issue in either the BOINC support code or somewhere in Windows that only adversely affects MCM1 in an obvious way. I am not sure that it's just MCM1 at WCG that sees the "stuck task" syndrome - I've seen it reported at Einstein and MilkyWay on occasion (restarting BOINC or rebooting seem to be the recommended cures there too!) Hence my suggestion that it could be something in the BOINC support code that gets confused every now and then... I do not get paid to do QA work for the project. It is not my job to do that. Not my responsibility. "Not my circus, not my monkey." Except that as a community we try to help one another, which includes some of us trying to establish the causes of problems that didn't show up at Beta Test time (there is no such thing as bug-free software, I fear!)In your case I'd be interested to know several things:
I hope you find some resolution to your problem - I wonder if some system workload tuning may be of help! Cheers - Al. |
||
|
|
rcthardcore
Cruncher United States Joined: Jan 29, 2009 Post Count: 13 Status: Offline Project Badges:
|
BOINC version 7.16.11
----------------------------------------I run 12 MCM work units at one time Sometimes I run either Primegrid GPU or Collatz GPU work units at the same time I am running MCM. I do not run any other WCG projects at the same time as MCM. I suspend all work units and projects when I shut BOINC down. I do not run BOINC when I am gaming. I suspect that this could possibly be a BOINC issue and not specifically related to MCM or any other projects. I will continue to run WCG projects regardless.
AMD Ryzen 9 5950x
NVIDIA RTX 3090 FE 128 GB DDR4-3200 Windows 10 64-bit 21H1 |
||
|
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges:
|
AMD Ryzen 9 5950x 128 GB is a lot of memory, is some of your DDR4 overheating or faulty?NVIDIA RTX 3090 FE 128 GB DDR4-3600 Windows 10 64-bit 21H1 More RAM = more heat. Faster RAM = more heat. Try running Prime95 turture test and furmark at the same time for maximum heat. Any Prime95 errors or computer crashes? Is DDR4 all unbuffered ECC (Error Correcting Code) DDR4 on compatible motherboard? or non-ECC's risk of memory corruption? As I said before, my non-ECC have some issues with hung MCM1 tasks, ARP1 invalids, and filesystem corruptions. With the switch to DDR4 unbuffered ECC it works fine for me with no more hung tasks, no more corruptions, so this looks like my old non-ECC RAM was going bad. My AMD FX-4100 Linux, and Ryzen 2700x Win10, both now have an amazing uptime 112 days so far. Both use ECC Unbuffered RAM. |
||
|
|
rcthardcore
Cruncher United States Joined: Jan 29, 2009 Post Count: 13 Status: Offline Project Badges:
|
No RAM overheating issues whatsoever. I suspect MCM has the same memory problem that Open Pandemics has.
----------------------------------------
AMD Ryzen 9 5950x
NVIDIA RTX 3090 FE 128 GB DDR4-3200 Windows 10 64-bit 21H1 |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1317 Status: Offline Project Badges:
|
@rcthardcore
Thanks for answering my list of points - I just thought it might be worth showing the checklist in case someone found it of use(!), but your reply didn't have anything one could point a finger at! I suspect that this could possibly be a BOINC issue and not specifically related to MCM or any other projects. I sometimes wonder if tasks hang up because the thread synchronization loses track of something somewhere (which may just be down to circumstances, explaining why it's not a frequently reported issue.) Over at MilkyWay they use OpenMP for a multi-threaded application and occasionally someone will post about a task that is spinning its wheels... But, again, the number of posts on that topic is far fewer than the number of tasks being completed, suggesting that it is not a very frequent issue (or that lots of people running that application never notice their system has problems?)...I suspect MCM has the same memory problem that Open Pandemics has. If MCM1 is leaking memory, it is nowhere near as bad as OPN1; I just looked at my MCM1 tasks and all of them have working set and virtual memory size comfortably below 100MB! That's on Linux in my case, but OPN1 memory leak was happening on Linux as well as on Windows, so I think the observation is valid. So if it is a memory use issue it's unlikely to be the same one as OPN1... Not much help, I'm afraid :-( (But I'll be interested to see where that leak turns out to be...) All that said, if suspending (with LAIM off) and resuming a task gets it going again, at least there's a work-around for the issue... Cheers - Al. |
||
|
|
sam6861
Advanced Cruncher Joined: Mar 31, 2020 Post Count: 107 Status: Offline Project Badges:
|
I don't see problems with MCM1.
The software issue could possibly be, the anti-virus false positive or bugs, game optimizer software which can freeze BOINC tasks, or malware/viruses such as a stealthy mining malware stealing CPU/GPU usage. If you switch to a new installation Linux/Windows and issues do not occur there, it is possible this could be some software issue. If CPU is overclocked, it possibly can be unstable if went too high. I may still point at something about DDR4. Ryzen 5950x memory controller rated DDR4-3200. Running speed at 3600 with all 4 memory slots used is possibly pushing too far into instability. Try again with DDR4 speed reduced to 3200, does random task hang stops? Ryzen 3900x memory controller rated DDR4-3200, and I use DDR4-3200 ECC UDIMM 1.2v. - 2x16GB at 3466 1.2volt: works, so I may be slightly taking a small risk but it works thanks to ECC. -- Works fine at 100%, 16 MCM1 + 8 ARP1 + OPNG works. Ryzen 2700x memory controller rated DDR4-2933, and I use DDR4-3200 ECC UDIMM 1.2v. - 4x16GB at 2933: Works. - 4x16GB at 3200: Corrected errors everywhere in memtest86 and a random reboot. - 2x16GB at 3200: Works, 119 days uptime, using this for fastest speed. 100% BOINC tasks all works. |
||
|
|
Felix Kaeufer
Cruncher Joined: Feb 3, 2012 Post Count: 29 Status: Offline Project Badges:
|
Probably not related to the aforementioned issue, I have four MCM tasks stuck initializing. Two have likely started before upgrading to macOS 12 (.0.1) and are stuck at a moderate time. The other two have just started and won't make it past a few seconds, before the timer resets.
stderr.txt repeatedly states: 22:30:35] Initializing Commandline = wcgrid_mcm1_map_7.44_x86_64-apple-darwin -SettingsFile MCM1_0183579_7923.txt -DatabaseFile dataset-sarc1.txt I'm pretty certain it's related to the OS upgrade. Is there a known workaround? |
||
|
|
Falconet
Master Cruncher Portugal Joined: Mar 9, 2009 Post Count: 3315 Status: Offline Project Badges:
|
Probably not related to the aforementioned issue, I have four MCM tasks stuck initializing. Two have likely started before upgrading to macOS 12 (.0.1) and are stuck at a moderate time. The other two have just started and won't make it past a few seconds, before the timer resets. stderr.txt repeatedly states: 22:30:35] Initializing Commandline = wcgrid_mcm1_map_7.44_x86_64-apple-darwin -SettingsFile MCM1_0183579_7923.txt -DatabaseFile dataset-sarc1.txt I'm pretty certain it's related to the OS upgrade. Is there a known workaround? Maybe it's best to just abort them. ![]() - AMD Ryzen 5 1600AF 6C/12T 3.2 GHz - 85W - AMD Ryzen 5 2500U 4C/8T 2.0 GHz - 28W - AMD Ryzen 7 7730U 8C/16T 3.0 GHz |
||
|
|
|