| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 8
|
|
| Author |
|
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges:
|
I just checked upon my computer and found it in panic mode, even though my queue is only set to 1.5 days. I found 14 jobs had been suspended, with perhaps 24 hours in uncheckpointed work.
----------------------------------------I tried to take manual intervention to allow the jobs with large amounts of uncheckpointed work and low time-to-completion to run, only to see the elapsed time revert to the last checkpoint. The suspended jobs had been sitting in memory. Cheers ![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The symptom of many preempted in panic state is not supposed to happen with client version 7.
|
||
|
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges:
|
I only recently noticed that WCG has moved up to 7.2.47 from 6.10.58 and I haven't upgraded yet. I'll have to do that very soon. What I was concerned about was the loss of all the time when the tasks were resumed even though they were still in memory Rob. I watched about 24 hours of work vanish.
----------------------------------------Cheers ![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Ignored the loss part as it was rather late for me, and gave the key fix on the retired version you would have had to have. Why BOINC unloaded those 14 even though LAIM was on would be speculating, 6.10.58 belonging now to the ancient. Jobs in panic state get preempted as you noticed without waiting for a checkpoint, are even forced to stay in memory without LAIM if not haven't done checkpoint at all. Maybe a memory shortage, but even in that case the client is supposed to park cores with 'waiting for memory'.
There's a few more reports of OET loosing progress, as if LAIM is not on, seeing it on Android myself. Killed a 333 long one because it ran 12+ hours aiming for 100 hours total runtime, was at 10 percent a sure point to not have been past the 12.5% first checkpoint, yet when looking again was back to 0%. [Going to NativeBOINC on Android soon for multiple control/configuration reasons]. Otherwise on the 2 comps seen nothing of the sorts, having done > 3000 of these, but then no panic, keeping a 1 day buffer with 7.4 releases. 7.2.47 has several advantages to prevent this issue even developing, most prominently it is ignoring the client DCF function, so runtime projections are an average of what the server knows, not an incidental long task that can blow the buffer totally out of whack, then you getting that experience. Also v7 stops trying to find shorter tasks in panic state. Think [but am unsure] the max is equal to the cores, so a quad would not see more than 4. |
||
|
|
NixChix
Veteran Cruncher United States Joined: Apr 29, 2007 Post Count: 1187 Status: Offline Project Badges:
|
I just updated to BOINC 7.2.47 this morning. Since the last pots, the client was still acting strange since it still thought that the small queue of jobs that it had downloaded were going to take 5 days each. I had to abort many jobs to calm the client down. The jobs ended up taking just a few hours each. I'll be glad that I now have a version that doesn't have the panic mode problem. I've had the panic mode problem happen to me many times over the past few years. I'll keep a lookout to see if any other the oddities return.
----------------------------------------Cheers ![]() ![]() |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Since you knew the work was very much shorter, a quick fix would have been to stop BOINC, open the client_state_xml and search for 'World'. Below the start of the WCG section in that file there's a line as below. Change the value as what it is enforced by WCG to be under version 7, fixed 1.0
<duration_correction_factor>1.000000</duration_correction_factor> This project disables the DCF functionality -for this grid only- with <dont_use_dcf/>. To be clear, 'Panic State', really 'High Priority', where a task(s) under deadline threat will be advanced, still occurs, but the v7 client wont go into a never ending 'try to see if this one is quicker'. A max equal to cores this could continue, but then it simply goes to EDF (Earliest Deadline First). It's unlikely to happen again, but see the occasional repair job getting the VIT treatment when it has sat too long in the queue. They will be started no matter what if getting within the 1 day deadline and a few more exceptions such as when the buffer is more than half of the remaining deadline time. This means if you have a 1.5 day buffer and a repair job arrives with 3.5 days, it will be started earlier than in a normal FIFO cycle in which BOINC runs standard [per attached project]. |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
Since you knew the work was very much shorter, a quick fix would have been to stop BOINC, open the client_state_xml and search for 'World'. Below the start of the WCG section in that file there's a line as below. Change the value as what it is enforced by WCG to be under version 7, fixed 1.0 <duration_correction_factor>1.000000</duration_correction_factor> Searching for "World" brings me to rnaworld Changing the value only has effect as long as the BOINC-client doesn't make contact to WCG-server. After the first contact the value will change to 1 again. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hmmm, this is for the 6.12 client and prior as a temp quick fix! Of course, if you enter 1.000000 -with- v7, there's no need to change as 1.000000 remains, yes you guessed it, 1.000000
![]() |
||
|
|
|