| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 22
|
|
| Author |
|
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges:
|
BobbyB, Thanks for the reply. I think a graphic is worth a million words. But I cannot figure out how to insert a graphic into this space. So I will try to explain what happens. When I rebooted my computer, just a few minutes ago, 16 work units varied from 5 percent to 70 percent completed. After rebooting ALL of the work units are starting at 0 percent. The work units have the same names so the work units did not delete, upload or whatever, they just started over at the beginning. This started in the last week or so. dondee don't reboot the computer until they are finished. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 1320 Status: Offline Project Badges:
|
Falconet, Thanks for the reply. This has started in the last week or so. I am running only mips at this time. dondee As we don't know what your hardware is, we're all speculating(!) so some of the below may be irrelevant. However, on the assumption that you aren't running an extremely powerful system... MIP1 tasks do a lot of widely-scattered memory access, which results in a lot of level 3 (L3) cache misses; running multiple MIP1 tasks at the same time can make your machine memory-bound as a result, causing the tasks to take much longer than they would if fewer were run at the same time. Combine that with a recent run of "no checkpoints" jobs Falconet referenced and it could well be that none of your MIP1 jobs checkpoint in the time between re-boots. The rule of thumb for MIP1 is to only run one task for each 4 or 5 MB of L3 cache your machine has. It really does make a difference... (There were discussions about this in the MIP1 forum quite a while ago...) Running a few MCM1 tasks (or SCC1 if it ever comes back) alongside MIP1 gives a good mix as MCM1 is compute-intensive and easy on memory! The point raised about reboots is also valid to an extent - if you can persuade your machine to hibernate when powering it off, rather than shutting it down and rebooting, you'll be far less likely to lose work done. However, not all machines play nice with hibernation, but it's worth the possible hassle if you can get it to work... (Note - hibernate, not suspend!) However, if you're rebooting because you have a dual-boot system and want to switch Operating Systems, the hibernate probably won't work anyway! Hope this helps - Al. [Edited to add dual-boot comment...] [Edit 1 times, last edit by alanb1951 at Apr 16, 2021 4:28:58 PM] |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
don't reboot the computer until they are finished. That's not going to work since as one finishes another starts so it will never get to a point where they are all finished unless you exhaust the supply of WUs from WCG.Then it must be this: I believe there used to be MIP tasks that only had 1 structure to resolve and thus never even checkpointed. Others had 2 structures to resolve and only checkpointed once after finishing the first structure. Even this does not make complete sense since I see my MIPs checkpoint every 5-6 minutes. I'll check again to make very sure. One way to test would be if both of us are doing the same MIP WU. I think this may be hard to perform. I could ask that you check the CPU time since checkpoint but since they start at zero then obvious they must not checkpoint. But check anyway, to be sure, the next time you reboot especially the ones that are well into it like 50%+. [Edit 2 times, last edit by BobbyB at Apr 16, 2021 4:39:06 PM] |
||
|
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges:
|
don't reboot the computer until they are finished. That's not going to work since as one finishes another starts so it will never get to a point where they are all finished unless you exhaust the supply of WUs from WCG.keep a very small cache. set NNT (No New Tasks) shortly before you want to reboot. allow tasks to finish. reboot/shut down. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti [Edit 1 times, last edit by Ian-n-Steve C. at Apr 16, 2021 4:48:14 PM] |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
OK so I checked and I have MIPs which checkpoint at 5-6 minutes and some which do not. Across all 4 machines there are more which do not checkpoint. I don't have tons of MIPs.
----------------------------------------Since you only do MIP then you must have those which do not checkpoint when you reboot. The test is to find one which does checkpoint and reboot to see if it starts at zero or not. How does one know which MIP does what is unknown to me. As noted above you may need to schedule your reboots so as to not lose work done. I only have one thing to add to the above post. No new tasks in project tab then suspend all task waiting to start in task tab then let all active tasks finish and now reboot. What I learnt from this thread is to also check MIPs along with ARPs on those rare occasions when I want to reboot. [Edit 4 times, last edit by BobbyB at Apr 16, 2021 5:51:47 PM] |
||
|
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 328 Status: Offline Project Badges:
|
From memory the number of checkpoints in the job can be deduced once the work unit begins to run.
If you look in the stderr file in the slots directory for the executing work unit, there is a parameter 'nstruct=nn' which I believe is the number of structures (and hence checkpoints) to be run. Unfortunatly I do not have any MIP work units available to confirm the above. |
||
|
|
Brian Nixon
Cruncher United Kingdom Joined: Oct 27, 2020 Post Count: 9 Status: Offline Project Badges:
|
parameter 'nstruct=nn' That’s also visible even before a task starts in its <command_line> element in BOINC’s client_state.xml |
||
|
|
dondee
Advanced Cruncher Joined: Jan 16, 2006 Post Count: 100 Status: Offline Project Badges:
|
Hello All,
Thanks for all of the responses. I will try to address each one as best I can. My machines are two ryzen 1700 with 16 gigs of ram. I tried to run mips before and had to reduce the number of threads from 16 to 8 because of the work unit errors. This time I seem to have fewer errors so I decided to stay with it a while. One of my computers is a dual boot system and I have to access the other drive on occasion. Also, the kernel has to be updated by rebooting when a new one is downloaded to finish installation. Checking the properties for three mips revealed checkpoints for each. I will in the near future go through each one note the checkpoint time and reboot to see what happens. CPU time 05:20:14 CPU time since checkpoint 00:35:42 The search on my machine for nstruct=nn reveals nothing, is this for a linux based computer? dondee |
||
|
|
Brian Nixon
Cruncher United Kingdom Joined: Oct 27, 2020 Post Count: 9 Status: Offline Project Badges:
|
It’s just nstruct; there’s no ‘=’, and ‘nn’ is a number (which I’ve seen in the range 1–30). I’m on Windows, but I don’t know any reason it would be different on Linux.
----------------------------------------Sample excerpt from client_state.xml; I’ve snipped a lot of irrelevant detail, but note the -nstruct 2: <workunit> [Edit 1 times, last edit by Brian Nixon at Apr 17, 2021 7:53:13 AM] |
||
|
|
BobbyB
Veteran Cruncher Canada Joined: Apr 25, 2020 Post Count: 638 Status: Offline Project Badges:
|
had to reduce the number of threads from 16 to 8 because of the work unit errors Why not run other WUs at the same time like MCM or OPN. I understand no ARP. OR limit MIPs to 8 in your profile and let the rest go. (no ARP) Also, the kernel has to be updated by rebooting when a new one is downloaded to finish installation. I turned off the automatic updates on my Ubuntu machines. They just do WCG so what am I updating? software here and there which is not being used. In this situation I also see no reason to even update the kernel. It works for my needs. If I need to shutdown/reboot one then I manually start the update process.[Edit 3 times, last edit by BobbyB at Apr 17, 2021 3:52:50 PM] |
||
|
|
|