Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Microbiome Immunity Project Thread: Lots of MIP1 WUs error out |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 70
|
Author |
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
There seem to be a lot of MIP1 jobs with one long structure at present, and I notice progress for these seeming to stop at 40% for 10 to 15 minutes before going up to 60% and sticking at that value until the job finishes (at which point it shows 100% as expected). Thanks, I have never noticed it. That is because I run my machines 24/7 and don't look at the progress very often. But my Ryzen 3600 (Ubuntu 18.04.5) runs four MIP at a time, and over the past seven days has averaged 50 minutes each. The longest one is 2 hours 56 minutes. But I see one of them "stuck" at 7.500%, and two others at 30.000% now. I never paid much attention to it before. (No errors though.) [Edit 1 times, last edit by Jim1348 at Sep 12, 2020 5:00:46 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
app_config.xml
----------------------------------------</app> <app> <name>mip1</name> <max_concurrent>2</max_concurrent> <fraction_done_exact/> </app> Neither am I paying any attention at all, but the fraction_done_exact flag is supposed to for a compute of progress based on FPOPS in header versus actual computing time so far. Momentarily have oner after 40 minutes on 40.000% without a logged checkpoint. PS. never an error for MIP1 except occasionally after booting. [Edit 1 times, last edit by Former Member at Sep 12, 2020 5:20:10 PM] |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: |
There seem to be a lot of MIP1 jobs with one long structure at present, and I notice progress for these seeming to stop at 40% for 10 to 15 minutes before going up to 60% and sticking at that value until the job finishes (at which point it shows 100% as expected). This isn't what I'm seeing. When I turn MIP1 back on (like this morning), I still get segfaults for every single MIP1 (and only MIP1) WU that tries to run. They *do* last report progress at 60%, but they are not completing. They do not upload; they do not appear in the joblog; they do show up in coredumpctl. These jobs are still segfaulting, 100% of the time. It's also not a memory issue. Not as in memory capacity, anyway; there's plenty of that to go around on the machines in question. Also, Rosetta@Home jobs are running on the same machines and are fine. @uplinger I have a core file if that would help. [Edit 1 times, last edit by mdxi at Sep 21, 2020 1:01:15 AM] |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 325 Status: Offline Project Badges: |
The failure of MIP1 work units when starting up a PC is a long standing problem. A way of avoiding it is to hibernate the PC instead of doing a shut down when closing a PC overnight. It even avoids losing work done since the previous checkpoints.
|
||
|
julemand101
Cruncher Denmark Joined: Feb 28, 2020 Post Count: 7 Status: Offline Project Badges: |
Any new status on this issue? My Arch Linux server are still failing with the same error. If you need any technical details or need help with testing some ideas please tell.
|
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Recently Active Project Badges: |
One thing that may be tried is to shut down or snooze BOINC before shutting down. Once you restart, simply unsnooze BOINC (if you have set to auto start when your machine resumes) or restart BOINC (if you machine does not have BOINC autostart when you restart your machine.)
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
mdxi
Advanced Cruncher Joined: Dec 6, 2017 Post Count: 109 Status: Offline Project Badges: |
My machines run 24/7, excepting kernel upgrades (which I do roughly biweekly), which require a full OS reboot. There is no hibernation or daily startup. Also, as previously stated:
----------------------------------------
|
||
|
julemand101
Cruncher Denmark Joined: Feb 28, 2020 Post Count: 7 Status: Offline Project Badges: |
My server are running 24/7 so I don't have any scenario where I am powering off BOINC. And since ALL my MIP1 tasks (multiple every day) are failing on this Linux server, with the same error message, I don't think the problem comes from bad behavior of powering off the BOINC process.
----------------------------------------mdxi: Same I see on my server. [Edit 1 times, last edit by julemand101 at Sep 29, 2020 5:57:35 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7579 Status: Recently Active Project Badges: |
My server are running 24/7 so I don't have any scenario where I am powering off BOINC. And since ALL my MIP1 tasks (multiple every day) are failing on this Linux server, with the same error message, I don't think the problem comes from bad behavior of powering off the BOINC process. mdxi: Same I see on my server. OK, not the best of suggestions. I am going to change one of my Linux machines to MIP and see if I can replicate your problems. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My machines run 24/7, excepting kernel upgrades (which I do roughly biweekly), which require a full OS reboot. There is no hibernation or daily startup. Also, as previously stated:
Might it potentially be due to an upgrade done to something like glibc that's not compatible with the work unit. It seems plausible that if you are experiencing 100% failure and the work was the problem, everyone else would have the failure too. Have you tried rebooting a previous version of the kernel environment to see of the errors still exist? |
||
|
|