Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 70
Posts: 70   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 24041 times and has 69 replies Next Thread
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

There seem to be a lot of MIP1 jobs with one long structure at present, and I notice progress for these seeming to stop at 40% for 10 to 15 minutes before going up to 60% and sticking at that value until the job finishes (at which point it shows 100% as expected).

Thanks, I have never noticed it. That is because I run my machines 24/7 and don't look at the progress very often.

But my Ryzen 3600 (Ubuntu 18.04.5) runs four MIP at a time, and over the past seven days has averaged 50 minutes each. The longest one is 2 hours 56 minutes. But I see one of them "stuck" at 7.500%, and two others at 30.000% now. I never paid much attention to it before.

(No errors though.)
----------------------------------------
[Edit 1 times, last edit by Jim1348 at Sep 12, 2020 5:00:46 PM]
[Sep 12, 2020 4:56:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

app_config.xml

</app> <app>
<name>mip1</name>
<max_concurrent>2</max_concurrent>
<fraction_done_exact/>
</app>
Neither am I paying any attention at all, but the fraction_done_exact flag is supposed to for a compute of progress based on FPOPS in header versus actual computing time so far. Momentarily have oner after 40 minutes on 40.000% without a logged checkpoint.

PS. never an error for MIP1 except occasionally after booting.
----------------------------------------
[Edit 1 times, last edit by Former Member at Sep 12, 2020 5:20:10 PM]
[Sep 12, 2020 5:19:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out


There seem to be a lot of MIP1 jobs with one long structure at present, and I notice progress for these seeming to stop at 40% for 10 to 15 minutes before going up to 60% and sticking at that value until the job finishes (at which point it shows 100% as expected).

This isn't what I'm seeing. When I turn MIP1 back on (like this morning), I still get segfaults for every single MIP1 (and only MIP1) WU that tries to run. They *do* last report progress at 60%, but they are not completing. They do not upload; they do not appear in the joblog; they do show up in coredumpctl. These jobs are still segfaulting, 100% of the time.

It's also not a memory issue. Not as in memory capacity, anyway; there's plenty of that to go around on the machines in question. Also, Rosetta@Home jobs are running on the same machines and are fine.

@uplinger I have a core file if that would help.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by mdxi at Sep 21, 2020 1:01:15 AM]
[Sep 21, 2020 1:00:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 325
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

The failure of MIP1 work units when starting up a PC is a long standing problem. A way of avoiding it is to hibernate the PC instead of doing a shut down when closing a PC overnight. It even avoids losing work done since the previous checkpoints.
[Sep 21, 2020 7:12:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
julemand101
Cruncher
Denmark
Joined: Feb 28, 2020
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

Any new status on this issue? My Arch Linux server are still failing with the same error. If you need any technical details or need help with testing some ideas please tell.
[Sep 29, 2020 8:53:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

One thing that may be tried is to shut down or snooze BOINC before shutting down. Once you restart, simply unsnooze BOINC (if you have set to auto start when your machine resumes) or restart BOINC (if you machine does not have BOINC autostart when you restart your machine.)
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 29, 2020 2:43:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mdxi
Advanced Cruncher
Joined: Dec 6, 2017
Post Count: 109
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

My machines run 24/7, excepting kernel upgrades (which I do roughly biweekly), which require a full OS reboot. There is no hibernation or daily startup. Also, as previously stated:

  • These machines were successfully crunching MIP workunits for two years before this issue arose
  • These aren't Windows machines
  • There is no change in hardware or usage pattern; all indications point to software
  • It's not "some" WUs failing under "some" circumstances; it is 100% failure, all in the exact same way

----------------------------------------

[Sep 29, 2020 2:52:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
julemand101
Cruncher
Denmark
Joined: Feb 28, 2020
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

My server are running 24/7 so I don't have any scenario where I am powering off BOINC. And since ALL my MIP1 tasks (multiple every day) are failing on this Linux server, with the same error message, I don't think the problem comes from bad behavior of powering off the BOINC process.

mdxi: Same I see on my server.
----------------------------------------
[Edit 1 times, last edit by julemand101 at Sep 29, 2020 5:57:35 PM]
[Sep 29, 2020 5:56:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7579
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

My server are running 24/7 so I don't have any scenario where I am powering off BOINC. And since ALL my MIP1 tasks (multiple every day) are failing on this Linux server, with the same error message, I don't think the problem comes from bad behavior of powering off the BOINC process.

mdxi: Same I see on my server.

OK, not the best of suggestions. I am going to change one of my Linux machines to MIP and see if I can replicate your problems.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Sep 29, 2020 8:06:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Lots of MIP1 WUs error out

My machines run 24/7, excepting kernel upgrades (which I do roughly biweekly), which require a full OS reboot. There is no hibernation or daily startup. Also, as previously stated:

  • These machines were successfully crunching MIP workunits for two years before this issue arose
  • These aren't Windows machines
  • There is no change in hardware or usage pattern; all indications point to software
  • It's not "some" WUs failing under "some" circumstances; it is 100% failure, all in the exact same way

Might it potentially be due to an upgrade done to something like glibc that's not compatible with the work unit. It seems plausible that if you are experiencing 100% failure and the work was the problem, everyone else would have the failure too. Have you tried rebooting a previous version of the kernel environment to see of the errors still exist?
[Sep 30, 2020 2:23:19 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 70   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 | Next Page ]
[ Jump to Last Post ]
Post new Thread