Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 22
Posts: 22   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 5413 times and has 21 replies Next Thread
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

BobbyB,
Thanks for the reply. I think a graphic is worth a million words.
But I cannot figure out how to insert a graphic into this space.
So I will try to explain what happens. When I rebooted my computer, just a few minutes ago, 16 work units varied from 5 percent to 70 percent completed. After rebooting ALL of the work units are starting at 0 percent. The work units have the same names so the work units did not delete, upload or whatever, they just started over at the beginning.
This started in the last week or so.
dondee


don't reboot the computer until they are finished.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
[Apr 16, 2021 2:57:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 1320
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

Falconet,
Thanks for the reply. This has started in the last week or so.
I am running only mips at this time.
dondee

As we don't know what your hardware is, we're all speculating(!) so some of the below may be irrelevant. However, on the assumption that you aren't running an extremely powerful system...

MIP1 tasks do a lot of widely-scattered memory access, which results in a lot of level 3 (L3) cache misses; running multiple MIP1 tasks at the same time can make your machine memory-bound as a result, causing the tasks to take much longer than they would if fewer were run at the same time. Combine that with a recent run of "no checkpoints" jobs Falconet referenced and it could well be that none of your MIP1 jobs checkpoint in the time between re-boots.

The rule of thumb for MIP1 is to only run one task for each 4 or 5 MB of L3 cache your machine has. It really does make a difference... (There were discussions about this in the MIP1 forum quite a while ago...) Running a few MCM1 tasks (or SCC1 if it ever comes back) alongside MIP1 gives a good mix as MCM1 is compute-intensive and easy on memory!

The point raised about reboots is also valid to an extent - if you can persuade your machine to hibernate when powering it off, rather than shutting it down and rebooting, you'll be far less likely to lose work done. However, not all machines play nice with hibernation, but it's worth the possible hassle if you can get it to work... (Note - hibernate, not suspend!) However, if you're rebooting because you have a dual-boot system and want to switch Operating Systems, the hibernate probably won't work anyway!

Hope this helps - Al.

[Edited to add dual-boot comment...]
----------------------------------------
[Edit 1 times, last edit by alanb1951 at Apr 16, 2021 4:28:58 PM]
[Apr 16, 2021 4:25:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 638
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

don't reboot the computer until they are finished.
That's not going to work since as one finishes another starts so it will never get to a point where they are all finished unless you exhaust the supply of WUs from WCG.

Then it must be this:
I believe there used to be MIP tasks that only had 1 structure to resolve and thus never even checkpointed. Others had 2 structures to resolve and only checkpointed once after finishing the first structure.

Even this does not make complete sense since I see my MIPs checkpoint every 5-6 minutes. I'll check again to make very sure.

One way to test would be if both of us are doing the same MIP WU. I think this may be hard to perform.

I could ask that you check the CPU time since checkpoint but since they start at zero then obvious they must not checkpoint. But check anyway, to be sure, the next time you reboot especially the ones that are well into it like 50%+.
----------------------------------------
[Edit 2 times, last edit by BobbyB at Apr 16, 2021 4:39:06 PM]
[Apr 16, 2021 4:37:31 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

don't reboot the computer until they are finished.
That's not going to work since as one finishes another starts so it will never get to a point where they are all finished unless you exhaust the supply of WUs from WCG.

keep a very small cache. set NNT (No New Tasks) shortly before you want to reboot. allow tasks to finish. reboot/shut down.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
----------------------------------------
[Edit 1 times, last edit by Ian-n-Steve C. at Apr 16, 2021 4:48:14 PM]
[Apr 16, 2021 4:46:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 638
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

OK so I checked and I have MIPs which checkpoint at 5-6 minutes and some which do not. Across all 4 machines there are more which do not checkpoint. I don't have tons of MIPs.

Since you only do MIP then you must have those which do not checkpoint when you reboot. The test is to find one which does checkpoint and reboot to see if it starts at zero or not.

How does one know which MIP does what is unknown to me.

As noted above you may need to schedule your reboots so as to not lose work done.

I only have one thing to add to the above post. No new tasks in project tab then suspend all task waiting to start in task tab then let all active tasks finish and now reboot.

What I learnt from this thread is to also check MIPs along with ARPs on those rare occasions when I want to reboot.
----------------------------------------
[Edit 4 times, last edit by BobbyB at Apr 16, 2021 5:51:47 PM]
[Apr 16, 2021 5:26:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 328
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

From memory the number of checkpoints in the job can be deduced once the work unit begins to run.
If you look in the stderr file in the slots directory for the executing work unit, there is a parameter 'nstruct=nn' which I believe is the number of structures (and hence checkpoints) to be run.
Unfortunatly I do not have any MIP work units available to confirm the above.
[Apr 16, 2021 7:22:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Brian Nixon
Cruncher
United Kingdom
Joined: Oct 27, 2020
Post Count: 9
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

parameter 'nstruct=nn'
That’s also visible even before a task starts in its <command_line> element in BOINC’s client_state.xml
[Apr 16, 2021 9:09:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
dondee
Advanced Cruncher
Joined: Jan 16, 2006
Post Count: 100
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

Hello All,
Thanks for all of the responses.
I will try to address each one as best I can.

My machines are two ryzen 1700 with 16 gigs of ram. I tried to run mips before and had to reduce the number of threads from 16 to 8 because of the work unit errors. This time I seem to have fewer errors so I decided to stay with it a while.

One of my computers is a dual boot system and I have to access the other drive on occasion. Also, the kernel has to be updated by rebooting when a new one is downloaded to finish installation.

Checking the properties for three mips revealed checkpoints for each. I will in the near future go through each one note the checkpoint time and reboot to see what happens.
CPU time
05:20:14
CPU time since checkpoint
00:35:42

The search on my machine for nstruct=nn reveals nothing, is this for a linux based computer?
dondee
[Apr 17, 2021 12:12:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Brian Nixon
Cruncher
United Kingdom
Joined: Oct 27, 2020
Post Count: 9
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

It’s just nstruct; there’s no ‘=’, and ‘nn’ is a number (which I’ve seen in the range 1–⁠30). I’m on Windows, but I don’t know any reason it would be different on Linux.

Sample excerpt from client_state.xml; I’ve snipped a lot of irrelevant detail, but note the -⁠nstruct 2:
<workunit>
<name>MIP1_00332248_8134</name>
<command_line>
-in::file::zip MIP1_databasev2.zip @./MIP1_00332248.flags -out::file::silent result_silent.out -run:jran 1150681919 -nstruct 2 -out::level 100 -run::no_scorefile true
</command_line>
</workunit>

----------------------------------------
[Edit 1 times, last edit by Brian Nixon at Apr 17, 2021 7:53:13 AM]
[Apr 17, 2021 7:52:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
BobbyB
Veteran Cruncher
Canada
Joined: Apr 25, 2020
Post Count: 638
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: let's restart all of the running work units for the fun of it

had to reduce the number of threads from 16 to 8 because of the work unit errors
Why not run other WUs at the same time like MCM or OPN. I understand no ARP.
OR limit MIPs to 8 in your profile and let the rest go. (no ARP)

Also, the kernel has to be updated by rebooting when a new one is downloaded to finish installation.
I turned off the automatic updates on my Ubuntu machines. They just do WCG so what am I updating? software here and there which is not being used. In this situation I also see no reason to even update the kernel. It works for my needs. If I need to shutdown/reboot one then I manually start the update process.
----------------------------------------
[Edit 3 times, last edit by BobbyB at Apr 17, 2021 3:52:50 PM]
[Apr 17, 2021 3:44:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 22   Pages: 3   [ Previous Page | 1 2 3 | Next Page ]
[ Jump to Last Post ]
Post new Thread