Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 2
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1767 times and has 1 reply Next Thread
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Tasks that were once running have changed to UNINITIALIZED

I have this curious case with boinccmd --get_tasks. About 10 hours ago, 4 ARP1 tasks and some others were running, when a bunch of FAH2 tasks arrived. The ARP1 tasks were at 87.5%, 59.2%, 53.8% and 10.2% and paused as soon as the FAH2 tasks started, so far so good.

So now, after 10 hours, there are still enough FAH2 tasks in my queue to keep the CPUs busy and 24 FAH2 tasks have been returned in the meantime.

What I'm seeing now is that all four ARP1 tasks are Waiting to run in Boinc Manager, but when I look with boinccmd --get_tasks, two of them are SUSPENDED and the other two are UNINITIALIZED. The latter is curious. I have never seen this behaviour before, tasks that have been running for hours going from SUSPENDED to UNINITIALIZED.

***

The two UNINITIALIZED ARP1 tasks are sitting at the top of my queue, numbered 1 and 2 in boinccmd --get_tasks.
In order to find out if this UNINITIALIZED thing is a problem, I suspended all uninitialized FAH2 tasks first, then another running FAH2 task to see which 'Waiting to run' ARP1 task would start. None of the UNINITIALIZED ARP1 tasks would resume, one SUSPENDED ARP1 task did continue.

This is the state of the 4 ARP1 tasks in boinccmd --get_tasks:
======== Tasks ========
1) -----------
name: ARP1_0022898_013_1
WU name: ARP1_0022898_013
project URL: http://www.worldcommunitygrid.org/
received: Sat May 30 12:57:37 2020
report deadline: Sat Jun 6 12:57:36 2020
ready to report: no
state: downloaded
scheduler state: preempted
active_task_state: UNINITIALIZED
app version num: 727
resources: 1 CPU
estimated CPU time remaining: 8077.052964
slot: 7
PID: 327624
CPU time at last checkpoint: 56283.830000
current CPU time: 56288.750000
fraction done: 0.875000
swap size: 816 MB
working set size: 697 MB
2) -----------
name: ARP1_0025860_012_0
WU name: ARP1_0025860_012
project URL: http://www.worldcommunitygrid.org/
received: Sat May 30 12:59:45 2020
report deadline: Sat Jun 6 12:59:44 2020
ready to report: no
state: downloaded
scheduler state: preempted
active_task_state: UNINITIALIZED
app version num: 727
resources: 1 CPU
estimated CPU time remaining: 26344.652263
slot: 6
PID: 327718
CPU time at last checkpoint: 30399.810000
current CPU time: 34984.430000
fraction done: 0.592292
swap size: 816 MB
working set size: 697 MB
3) -----------
name: ARP1_0030492_013_0
WU name: ARP1_0030492_013
project URL: http://www.worldcommunitygrid.org/
received: Sun May 31 02:55:16 2020
report deadline: Sun Jun 7 02:55:15 2020
ready to report: no
state: downloaded
scheduler state: preempted
active_task_state: SUSPENDED
app version num: 727
resources: 1 CPU
estimated CPU time remaining: 29579.308140
slot: 2
PID: 327857
CPU time at last checkpoint: 30048.760000
current CPU time: 31902.590000
fraction done: 0.537500
swap size: 816 MB
working set size: 675 MB
4) -----------
name: ARP1_0028420_013_1
WU name: ARP1_0028420_013
project URL: http://www.worldcommunitygrid.org/
received: Sun May 31 02:57:40 2020
report deadline: Sun Jun 7 02:57:39 2020
ready to report: no
state: downloaded
scheduler state: preempted
active_task_state: SUSPENDED
app version num: 727
resources: 1 CPU
estimated CPU time remaining: 57373.196450
slot: 9
PID: 327887
CPU time at last checkpoint: 0.000000
current CPU time: 6098.635000
fraction done: 0.102917
swap size: 811 MB
working set size: 689 MB

This is how an executing FAH2 task and an uninitialized FAH2 task look like in boinccmd --get_tasks:
78) -----------
name: FAH2_002722_zinc00221636_000002_000076_159_0
WU name: FAH2_002722_zinc00221636_000002_000076_159
project URL: http://www.worldcommunitygrid.org/
received: Tue Jun 2 01:52:04 2020
report deadline: Wed Jun 3 01:52:04 2020
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 730
resources: 1 CPU
estimated CPU time remaining: 2335.942888
slot: 23
PID: 343061
CPU time at last checkpoint: 14055.610000
current CPU time: 14191.540000
fraction done: 0.858200
swap size: 1779 MB
working set size: 533 MB
79) -----------
name: FAH2_002721_zinc00068575_000001_000048_151_0
WU name: FAH2_002721_zinc00068575_000001_000048_151
project URL: http://www.worldcommunitygrid.org/
received: Tue Jun 2 02:48:47 2020
report deadline: Wed Jun 3 02:48:47 2020
ready to report: no
state: downloaded
scheduler state: uninitialized
active_task_state: UNINITIALIZED
app version num: 730
resources: 1 CPU
estimated CPU time remaining: 16473.504150

My fear is that the UNINITIALIZED ARP1 tasks will be stuck, or worse, after the FAH2 tasks have left.

***

In client_state.xml, the two UNINITIALIZED ARP1 tasks already have a final time:
#1:
    <name>ARP1_0022898_013_1</name>
<final_cpu_time>56288.750000</final_cpu_time>
<final_elapsed_time>59820.089335</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>x86_64-pc-linux-gnu</platform>
<version_num>727</version_num>
<final_peak_working_set_size>741015552</final_peak_working_set_size>
<final_peak_swap_size>855384064</final_peak_swap_size>
<final_peak_disk_usage>739699473</final_peak_disk_usage>
<wu_name>ARP1_0022898_013</wu_name>
<report_deadline>1591441056.000000</report_deadline>
<received_time>1590836257.242492</received_time>

#2:
    <name>ARP1_0025860_012_0</name>
<final_cpu_time>34984.430000</final_cpu_time>
<final_elapsed_time>37215.518088</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>x86_64-pc-linux-gnu</platform>
<version_num>727</version_num>
<final_peak_working_set_size>905129984</final_peak_working_set_size>
<final_peak_swap_size>1076940800</final_peak_swap_size>
<final_peak_disk_usage>703054751</final_peak_disk_usage>
<wu_name>ARP1_0025860_012</wu_name>
<report_deadline>1591441184.000000</report_deadline>
<received_time>1590836385.238859</received_time>

… while the other two (SUSPENDED) ARP1 units, don't have a final time (they're still zero):
#3:
    <name>ARP1_0030492_013_0</name>
<final_cpu_time>0.000000</final_cpu_time>
<final_elapsed_time>0.000000</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>x86_64-pc-linux-gnu</platform>
<version_num>727</version_num>
<wu_name>ARP1_0030492_013</wu_name>
<report_deadline>1591491315.000000</report_deadline>
<received_time>1590886516.214492</received_time>

#4:
    <name>ARP1_0028420_013_1</name>
<final_cpu_time>0.000000</final_cpu_time>
<final_elapsed_time>0.000000</final_elapsed_time>
<exit_status>0</exit_status>
<state>2</state>
<platform>x86_64-pc-linux-gnu</platform>
<version_num>727</version_num>
<wu_name>ARP1_0028420_013</wu_name>
<report_deadline>1591491459.000000</report_deadline>
<received_time>1590886660.135495</received_time>

Isn't this curious?

So, what I'm planning to do, maybe tomorrow, maybe tonight, is to remove the final time elements from the <result>-entry of the UNINITIALIZED tasks in client_state.xml to see if this will be a remedy.
----------------------------------------
[Edit 2 times, last edit by adriverhoef at Jun 2, 2020 10:40:54 AM]
[Jun 2, 2020 10:08:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Tasks that were once running have changed to UNINITIALIZED

Well, I was a bit curious (of course) about what would happen if I would pause all non-running tasks, except for one ARP1 task (the one that was sitting there, UNINITIALIZED and Waiting to run at 59.2%, #2 in the queue), and then pause only one running FAH2 task, so that that ARP1 task could resume. Would it überhaupt run? Would it resume at 59.2%?
$ wcgresults -NCLOSED1P | head -5
Deadline---------------- CPUtime LastChkpnt Estimated Percentage Status slOt Name----------------
Sat Jun 6 12:57:36 2020 15:38:08 0:00:04 2:14:01 87.5000% (-) 7 ARP1_0022898_013_1
Sat Jun 6 12:59:44 2020 9:43:04 1:16:24 6:41:21 59.2292% (-) 6 ARP1_0025860_012_0
Sun Jun 7 02:55:15 2020 8:52:17 0:31:28 7:36:52 53.8125% (W) 2 ARP1_0030492_013_0
Sun Jun 7 02:57:39 2020 1:41:47 1:41:47 14:47:14 10.2917% (W) 9 ARP1_0028420_013_1

***

Well, it did resume, but jumped back to its last checkpoint: 50%.
In any case, that's better than return to 0% or error out completely.

I still have the other UNINITIALIZED ARP1 task at 87.5%, four seconds after its last checkpoint.
$ wcgresults -NSO_CL1PPED | head -5
Deadline---------------- CPUtime LastChkpnt Estimated Percentage Status slOt Name----------------
Sat Jun 6 12:57:36 2020 15:38:08 0:00:04 2:14:01 87.5000% (-) 7 ARP1_0022898_013_1
Sat Jun 6 12:59:44 2020 8:27:32 0:00:52 8:25:25 50.1042% (W) 6 ARP1_0025860_012_0
Sun Jun 7 02:55:15 2020 8:52:17 0:31:28 7:36:52 53.8125% (W) 2 ARP1_0030492_013_0
Sun Jun 7 02:57:39 2020 1:41:47 1:41:47 14:47:14 10.2917% (W) 9 ARP1_0028420_013_1
Not afraid.

(Download wcgresults for Linux here .)
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Jun 2, 2020 1:28:35 PM]
[Jun 2, 2020 1:14:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread