World Community Grid - View Thread

World Community Grid Forums

Category: Completed Research

Forum: Outsmart Ebola Together

Thread: Checkpoint Problem

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 8

[ ]

Author

This topic has been viewed 2047 times and has 7 replies

NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

180 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

1 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Checkpoint Problem

I just checked upon my computer and found it in panic mode, even though my queue is only set to 1.5 days. I found 14 jobs had been suspended, with perhaps 24 hours in uncheckpointed work.

I tried to take manual intervention to allow the jobs with large amounts of uncheckpointed work and low time-to-completion to run, only to see the elapsed time revert to the last checkpoint. The suspended jobs had been sitting in memory.

Cheers coffee

----------------------------------------

[Feb 16, 2015 11:49:06 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Checkpoint Problem

The symptom of many preempted in panic state is not supposed to happen with client version 7.

[Feb 17, 2015 12:11:53 AM]

NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:


Re: Checkpoint Problem

I only recently noticed that WCG has moved up to 7.2.47 from 6.10.58 and I haven't upgraded yet. I'll have to do that very soon. What I was concerned about was the loss of all the time when the tasks were resumed even though they were still in memory Rob. I watched about 24 hours of work vanish.

Cheers coffee

----------------------------------------

[Feb 17, 2015 2:39:53 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Checkpoint Problem

Ignored the loss part as it was rather late for me, and gave the key fix on the retired version you would have had to have. Why BOINC unloaded those 14 even though LAIM was on would be speculating, 6.10.58 belonging now to the ancient. Jobs in panic state get preempted as you noticed without waiting for a checkpoint, are even forced to stay in memory without LAIM if not haven't done checkpoint at all. Maybe a memory shortage, but even in that case the client is supposed to park cores with 'waiting for memory'.

There's a few more reports of OET loosing progress, as if LAIM is not on, seeing it on Android myself. Killed a 333 long one because it ran 12+ hours aiming for 100 hours total runtime, was at 10 percent a sure point to not have been past the 12.5% first checkpoint, yet when looking again was back to 0%. [Going to NativeBOINC on Android soon for multiple control/configuration reasons]. Otherwise on the 2 comps seen nothing of the sorts, having done > 3000 of these, but then no panic, keeping a 1 day buffer with 7.4 releases.

7.2.47 has several advantages to prevent this issue even developing, most prominently it is ignoring the client DCF function, so runtime projections are an average of what the server knows, not an incidental long task that can blow the buffer totally out of whack, then you getting that experience. Also v7 stops trying to find shorter tasks in panic state. Think [but am unsure] the max is equal to the cores, so a quad would not see more than 4.

[Feb 17, 2015 8:20:57 AM]

NixChix
Veteran Cruncher
United States
Joined: Apr 29, 2007
Post Count: 1187
Status: Offline
Project Badges:


Re: Checkpoint Problem

I just updated to BOINC 7.2.47 this morning. Since the last pots, the client was still acting strange since it still thought that the small queue of jobs that it had downloaded were going to take 5 days each. I had to abort many jobs to calm the client down. The jobs ended up taking just a few hours each. I'll be glad that I now have a version that doesn't have the panic mode problem. I've had the panic mode problem happen to me many times over the past few years. I'll keep a lookout to see if any other the oddities return.

Cheers coffee

----------------------------------------

[Feb 22, 2015 5:44:53 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Checkpoint Problem

Since you knew the work was very much shorter, a quick fix would have been to stop BOINC, open the client_state_xml and search for 'World'. Below the start of the WCG section in that file there's a line as below. Change the value as what it is enforced by WCG to be under version 7, fixed 1.0

<duration_correction_factor>1.000000</duration_correction_factor>

This project disables the DCF functionality -for this grid only- with <dont_use_dcf/>.

To be clear, 'Panic State', really 'High Priority', where a task(s) under deadline threat will be advanced, still occurs, but the v7 client wont go into a never ending 'try to see if this one is quicker'. A max equal to cores this could continue, but then it simply goes to EDF (Earliest Deadline First). It's unlikely to happen again, but see the occasional repair job getting the VIT treatment when it has sat too long in the queue. They will be started no matter what if getting within the 1 day deadline and a few more exceptions such as when the buffer is more than half of the remaining deadline time. This means if you have a 1.5 day buffer and a repair job arrives with 3.5 days, it will be started earlier than in a normal FIFO cycle in which BOINC runs standard [per attached project].

[Feb 22, 2015 9:22:04 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Checkpoint Problem

Searching for "World" brings me to rnaworld tongue

Changing the value only has effect as long as the BOINC-client doesn't make contact to WCG-server.
After the first contact the value will change to 1 again.

[Feb 22, 2015 1:43:58 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Checkpoint Problem

Hmmm, this is for the 6.12 client and prior as a temp quick fix! Of course, if you enter 1.000000 -with- v7, there's no need to change as 1.000000 remains, yes you guessed it, 1.000000 wink

[Feb 22, 2015 1:51:44 PM]

[ ]