Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 8
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 515 times and has 7 replies Next Thread
spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 273
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Two "Too Late" units

I had two work units fail in an interesting way:
ARP1_0012386_139_0
ARP1_0031760_139_0

On both of these WUs, my cruncher turned in a work unit only for all of the other wingmen to error out before I could finish mine. Checking the logs, both of mine ran to completion.

Could these units be triggering some sort of CPU bug, or was I just extremely unlucky that these units hit a bunch of hosts with other problems (out of disk space, memory issues, etc.)?
----------------------------------------
[Edit 1 times, last edit by spRocket at Nov 5, 2024 2:20:00 PM]
[Nov 5, 2024 2:18:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2145
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

Your wingmen failed to download some input file(s), spRocket, that's why they errored out.
The same thing happened to two tasks on my devices:

workunit 626535846
ARP1_0030653_139_0  Linux Fedora  2Late  2024-11-04T08:21:25  2024-11-04T18:53:00    7.33/7.39     503.3/0.0
ARP1_0030653_139_1 Linux Ubuntu Error 2024-11-04T08:21:50 2024-11-05T00:20:39 0.00/0.00 0.0/0.0
ARP1_0030653_139_2 Linux Mageia Error 2024-11-05T01:10:22 2024-11-05T02:18:40 0.00/0.00 490.0/0.0
ARP1_0030653_139_3 Linux Ubuntu Error 2024-11-05T02:29:32 2024-11-05T03:56:43 0.00/0.00 0.0/0.0
ARP1_0030653_139_4 Linux Ubuntu Error 2024-11-05T04:03:14 2024-11-05T04:06:37 0.00/0.00 0.0/0.0
ARP1_0030653_139_5 Linux Ubuntu Error 2024-11-05T04:13:05 2024-11-05T05:52:41 0.00/0.00 0.0/0.0
Details: ---------------------------------------------------------------------------------------------------------------------------------------
ARP1_0030653_139_0  Linux Fedora  2Late  2024-11-04T08:21:25  2024-11-04T18:53:00    7.33/7.39     503.3/0.0
Logfile:
<core_client_version>7.20.2</core_client_version>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[13:21:09] INFO: Checkpoint taken at 2019-04-05_06:00:00
[14:27:20] INFO: Checkpoint taken at 2019-04-05_12:00:00
[15:24:00] INFO: Checkpoint taken at 2019-04-05_18:00:00
[16:06:00] INFO: Checkpoint taken at 2019-04-06_00:00:00
[16:53:03] INFO: Checkpoint taken at 2019-04-06_06:00:00
[18:01:03] INFO: Checkpoint taken at 2019-04-06_12:00:00
[19:00:27] INFO: Checkpoint taken at 2019-04-06_18:00:00
[19:43:09] INFO: Checkpoint taken at 2019-04-07_00:00:00
INFO: Simulation complete compressing output.
19:44:16 (1234681): called boinc_finish(0)

</stderr_txt>

Although, not quite the same (yet?), as the following one still has one wingman In Progress:

workunit 626514226
ARP1_0014490_126_0  Fedora Linux  2Late  2024-11-04T07:23:47  2024-11-05T03:10:13   14.39/15.69    632.1/0.0
ARP1_0014490_126_1 Linux Ubuntu Error 2024-11-04T07:23:52 2024-11-04T23:22:00 0.00/0.00 0.0/0.0
ARP1_0014490_126_2 Linux Debian InPrg 2024-11-04T07:23:55 2024-11-09T19:23:55 0.00/0.00 0.0/0.0
ARP1_0014490_126_3 Linuxmint Error 2024-11-04T23:36:39 2024-11-04T23:41:51 0.00/0.00 0.0/0.0
ARP1_0014490_126_4 Linuxmint Error 2024-11-04T23:50:28 2024-11-05T00:00:44 0.00/0.00 490.0/0.0
ARP1_0014490_126_5 Linux Ubuntu Error 2024-11-05T00:09:45 2024-11-05T01:15:45 0.00/0.00 0.0/0.0
ARP1_0014490_126_6 Linux Debian Error 2024-11-05T01:25:55 2024-11-05T01:32:51 0.00/0.00 490.0/0.0

Adri
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Nov 5, 2024 2:54:14 PM]
[Nov 5, 2024 2:52:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 96
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

I had two WUs that were also listed as being too late but were returned well within the deadline. I went back to check some more details about 10 minutes later and both were not listed as being too late anymore. Since I didn't make note of the WU names the first time I was not able to determine their ultimate status but as of now, I don't have any listed as being too late out of about 156 returned. I'm thinking, without further evidence, that it is a temporary status.
[Nov 5, 2024 3:45:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2145
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

The workunit that I reported earlier (being 'Too Late' to validate) finally ended up in 'Error', while two of the devices seemed to return a valid result within the projected deadline(s):

workunit 626514226
App: Africa Rainfall Project
Workunit: ARP1_0014490_126
Created: 2024-11-04T07:23:44
Quorum: 2
Replication: 3

ARP1_0014490_126_0 Fedora Linux Error 2024-11-04T07:23:47 2024-11-05T03:10:13 14.39/15.69 632.1/0.0
ARP1_0014490_126_1 Linux Ubuntu Error 2024-11-04T07:23:52 2024-11-04T23:22:00 0.00/0.00 0.0/0.0
ARP1_0014490_126_2 Linux Debian 2Late 2024-11-04T07:23:55 2024-11-06T04:57:14 29.92/29.92 687.6/0.0
ARP1_0014490_126_3 Linuxmint Error 2024-11-04T23:36:39 2024-11-04T23:41:51 0.00/0.00 0.0/0.0
ARP1_0014490_126_4 Linuxmint Error 2024-11-04T23:50:28 2024-11-05T00:00:44 0.00/0.00 490.0/0.0
ARP1_0014490_126_5 Linux Ubuntu Error 2024-11-05T00:09:45 2024-11-05T01:15:45 0.00/0.00 0.0/0.0
ARP1_0014490_126_6 Linux Debian Error 2024-11-05T01:25:55 2024-11-05T01:32:51 0.00/0.00 490.0/0.0
Details: ---------------------------------------------------------------------------------------------------------------------------------------
ARP1_0014490_126_0  Fedora Linux  Error  2024-11-04T07:23:47  2024-11-05T03:10:13   14.39/15.69    632.1/0.0
Sent Time: 2024-11-04T07:23:47+0000
Due Time: 2024-11-09T19:23:47+0000
Returned: 2024-11-05T03:10:13+0000
Result-ID: 1145012650
Logfile:
<core_client_version>7.20.2</core_client_version>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[10:47:52] INFO: Checkpoint taken at 2019-03-10_06:00:00
[13:14:22] INFO: Checkpoint taken at 2019-03-10_12:00:00
[15:24:09] INFO: Checkpoint taken at 2019-03-10_18:00:00
[16:54:31] INFO: Checkpoint taken at 2019-03-11_00:00:00
[18:39:56] INFO: Checkpoint taken at 2019-03-11_06:00:00
[21:02:27] INFO: Checkpoint taken at 2019-03-11_12:00:00
[23:10:08] INFO: Checkpoint taken at 2019-03-11_18:00:00
[00:40:39] INFO: Checkpoint taken at 2019-03-12_00:00:00
INFO: Simulation complete compressing output.
00:42:29 (1143452): called boinc_finish(0)

</stderr_txt>
ARP1_0014490_126_1 Linux Ubuntu Error 2024-11-04T07:23:52 2024-11-04T23:22:00 0.00/0.00 0.0/0.0
Sent Time: 2024-11-04T07:23:52+0000
Due Time: 2024-11-09T19:23:52+0000
Returned: 2024-11-04T23:22:00+0000
Result-ID: 1145012651
ARP1_0014490_126_2 Linux Debian 2Late 2024-11-04T07:23:55 2024-11-06T04:57:14 29.92/29.92 687.6/0.0
Sent Time: 2024-11-04T07:23:55+0000
Due Time: 2024-11-09T19:23:55+0000
Returned: 2024-11-06T04:57:14+0000
Result-ID: 1145012652
Logfile:
<core_client_version>7.14.2</core_client_version>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[03:20:00] INFO: Checkpoint taken at 2019-03-10_06:00:00
[07:48:22] INFO: Checkpoint taken at 2019-03-10_12:00:00
[12:02:15] INFO: Checkpoint taken at 2019-03-10_18:00:00
[15:04:09] INFO: Checkpoint taken at 2019-03-11_00:00:00
[18:07:13] INFO: Checkpoint taken at 2019-03-11_06:00:00
[22:43:53] INFO: Checkpoint taken at 2019-03-11_12:00:00
[02:50:59] INFO: Checkpoint taken at 2019-03-11_18:00:00
[05:36:33] INFO: Checkpoint taken at 2019-03-12_00:00:00
INFO: Simulation complete compressing output.
05:39:35 (23763): called boinc_finish(0)

</stderr_txt>
ARP1_0014490_126_3 Linuxmint Error 2024-11-04T23:36:39 2024-11-04T23:41:51 0.00/0.00 0.0/0.0
Sent Time: 2024-11-04T23:36:39+0000
Due Time: 2024-11-06T11:36:39+0000
Returned: 2024-11-04T23:41:51+0000
Result-ID: 1145826308
ARP1_0014490_126_4 Linuxmint Error 2024-11-04T23:50:28 2024-11-05T00:00:44 0.00/0.00 490.0/0.0
Sent Time: 2024-11-04T23:50:28+0000
Due Time: 2024-11-06T11:50:28+0000
Returned: 2024-11-05T00:00:44+0000
Result-ID: 1145844797
ARP1_0014490_126_5 Linux Ubuntu Error 2024-11-05T00:09:45 2024-11-05T01:15:45 0.00/0.00 0.0/0.0
Sent Time: 2024-11-05T00:09:45+0000
Due Time: 2024-11-06T12:09:45+0000
Returned: 2024-11-05T01:15:45+0000
Result-ID: 1145865187
ARP1_0014490_126_6 Linux Debian Error 2024-11-05T01:25:55 2024-11-05T01:32:51 0.00/0.00 490.0/0.0
Sent Time: 2024-11-05T01:25:55+0000
Due Time: 2024-11-06T13:25:55+0000
Returned: 2024-11-05T01:32:51+0000
Result-ID: 1145934576

Adri
[Nov 6, 2024 9:48:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
rilian
Veteran Cruncher
Ukraine - we rule!
Joined: Jun 17, 2007
Post Count: 1453
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

Same here, I have one task that crunched for 15 hours and returned in less than 2 days "Too Late" sad sad sad


ARP1_0001790_139_1 Linux Ubuntu
status Too Late
sent 2024-11-04 07:46:37 UTC
returned 2024-11-05 23:07:24 UTC
----------------------------------------
[Nov 6, 2024 5:13:57 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12310
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

The problem here is that 5 copies were returned with errors for whatever reason before the valid one was returned, but by that time the unit had been automatically Errored Out.

Mike
[Nov 6, 2024 9:34:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1944
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

Just noticed one ARP1 WU that one of my hosts returned during the day is marked as "too late", this despite the deadline was supposed to be on 11/12, with 5 other WUs send out (_0-_3, _5) all resulting in "error", and another copy, _6 just showing "other" with no further information (btw, these are all Windows 10 hosts)

Project name: Africa Rainfall Project
Created: Nov. 4, 2024 - 23:58 UTC
Name: ARP1_0024567_129
Minimum Quorum: 2
Replication: 3

Result name OS type OS version Status Sent time Time due/ Return time Cpu time/Elapsed time Claimed credit/ Granted credit

ARP1_0024567_129_0 Microsoft Windows 10 Professional x64 Edition, (10.00.19045.00) Error 2024-11-05 00:03:54 UTC 2024-11-07 20:12:33 UTC
ARP1_0024567_129_1 Microsoft Windows 11 Professional x64 Edition, (10.00.22631.00) Error 2024-11-05 00:03:40 UTC 2024-11-06 18:38:03 UTC 490 / 0
ARP1_0024567_129_2 Microsoft Windows 10 Professional x64 Edition, (10.00.19045.00) Error 2024-11-05 00:03:45 UTC 2024-11-06 12:04:58 UTC
ARP1_0024567_129_3 Microsoft Windows 10 Professional x64 Edition, (10.00.19045.00) Error 2024-11-06 12:33:36 UTC 2024-11-08 07:10:56 UTC 490 / 0
ARP1_0024567_129_4 Microsoft Windows 10 Professional x64 Edition, (10.00.19045.00) Too Late 2024-11-06 12:38:27 UTC 2024-11-08 16:13:44 UTC 19.13 / 20.31 734.7 / 0
ARP1_0024567_129_5 Microsoft Windows 10 Professional x64 Edition, (10.00.19045.00) Error 2024-11-06 12:37:17 UTC 2024-11-08 08:50:43 UTC
ARP1_0024567_129_6 Other

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by TPCBF at Nov 9, 2024 4:20:45 AM]
[Nov 9, 2024 4:14:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12310
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Two "Too Late" units

Version 6 probably was not sent as 5 errors had occurred.
[Nov 10, 2024 2:04:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread