| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 315
|
|
| Author |
|
|
JSYKES
Senior Cruncher Joined: Apr 28, 2007 Post Count: 206 Status: Offline Project Badges:
|
I have an ARP beta running (slowly!!) - currently 13hrs+ and only 32% completed - I have spotted that the checkpoint times seem to be long - currently at least 4hrs..... not sure if this a good (or bad?) thing - I guess it ought to be more frequent than that to allow for machine usage/routine start/stops etc?
----------------------------------------The SCC beta's have all raced through without a hitch.... ![]() |
||
|
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 328 Status: Offline Project Badges:
|
I have had one work unit which has gone invalid: BETA_ARP1_0000488_002.
Other wingmen had a range of status - 1 * no reply, 2 * valid, 1 * error and my 1 * invalid The wingman’s status log in error is: Result Log Result Name: BETA_ ARP1_ 0000488_ 002_ 2-- <core_client_version>5.4.11</core_client_version> <message> Couldn't start or resume: 2 </message> My invalid units had many error messages throughout but it continued running for over 23 hours. The error states the work unit was ‘out of memory’ but my PC is an I7-6700 with 16Gb of memory and reached a maximum of 45% memory usage. The output from my result log is: Result Log Result Name: BETA_ ARP1_ 0000488_ 002_ 4-- <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. Starting WRFMain [18:30:23] INFO: Checkpoint taken at 2018-04-05_06:00:00 ERROR: Out of Memory Error on compression . 18 identical messages removed . ERROR: Out of Memory Error on compression [22:18:30] INFO: Checkpoint taken at 2018-04-05_12:00:00 ERROR: Out of Memory Error on compression . 70 identical messages removed . ERROR: Out of Memory Error on compression [09:03:14] INFO: Checkpoint taken at 2018-04-05_18:00:00 ERROR: Out of Memory Error on compression . 71 identical messages removed . ERROR: Out of Memory Error on compression [11:38:44] INFO: Checkpoint taken at 2018-04-06_00:00:00 ERROR: Out of Memory Error on compression . 71 identical messages removed . ERROR: Out of Memory Error on compression [13:51:03] INFO: Checkpoint taken at 2018-04-06_06:00:00 ERROR: Out of Memory Error on compression . 70 identical messages removed . ERROR: Out of Memory Error on compression [17:14:52] INFO: Checkpoint taken at 2018-04-06_12:00:00 ERROR: Out of Memory Error on compression . 70 identical messages removed . ERROR: Out of Memory Error on compression [21:08:50] INFO: Checkpoint taken at 2018-04-06_18:00:00 ERROR: Out of Memory Error on compression . 74 identical messages removed . ERROR: Out of Memory Error on compression 00:00:58 (28276): called boinc_finish(0) </stderr_txt> The number of identical messages removed may be wrong by 1 or 2. |
||
|
|
Dangertk
Cruncher The Netherlands Joined: Oct 16, 2009 Post Count: 46 Status: Offline Project Badges:
|
Got one beta WU on android and one on Windows both have run without problems (both SCC betas and valid) for about 2 hours on windows and about 3.25 hours on Android. I noticed that my wingman on the windows WU claimed 80 points while I claimed 40 points. I don't know if that's related to the Beta but it seems a bit off.
-------------------------------------------------------------------------------- [Edit 1 times, last edit by Dangertk at Jun 19, 2019 8:17:21 AM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
A new kind of Beta
BETA_ ARP1_ 0000441_ 001_ 4-- Microsoft Windows 10 Core x64 Edition, (10.00.18362.00) - In Progress 6/18/19 23:37:50 6/21/19 10:25:50 0.00 0.0 / 0.0 BETA_ ARP1_ 0000441_ 001_ 3-- Microsoft Windows 10 Professional x64 Edition, (10.00.17763.00) - No Reply 6/16/19 12:49:48 6/18/19 23:37:48 0.00 0.0 / 0.0 BETA_ ARP1_ 0000441_ 001_ 2-- Microsoft Windows 10 Education x64 Edition, (10.00.17134.00) - No Reply 6/14/19 01:40:45 6/16/19 12:28:45 0.00 0.0 / 0.0 BETA_ ARP1_ 0000441_ 001_ 0-- Microsoft Windows 8.1 x64 Edition, (06.03.9600.00) 721 Pending Validation 6/7/19 01:40:39 6/8/19 06:14:44 20.64 173.7 / 0.0 BETA_ ARP1_ 0000441_ 001_ 1-- Microsoft Windows 7 Professional x64 Edition, Service Pack 1, (06.01.7601.00) - No Reply 6/7/19 01:40:39 6/14/19 01:40:39 0.00 0.0 / 0.0 If testing "how many do not make it to the Beta short deadline check", you've convinced me. 60% failure rate. |
||
|
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges:
|
A new kind of Beta BETA_ ARP1_ 0000441_ 001_ 4-- Microsoft Windows 10 Core x64 Edition, (10.00.18362.00) - In Progress 6/18/19 23:37:50 6/21/19 10:25:50 0.00 0.0 / 0.0 BETA_ ARP1_ 0000441_ 001_ 3-- Microsoft Windows 10 Professional x64 Edition, (10.00.17763.00) - No Reply 6/16/19 12:49:48 6/18/19 23:37:48 0.00 0.0 / 0.0 BETA_ ARP1_ 0000441_ 001_ 2-- Microsoft Windows 10 Education x64 Edition, (10.00.17134.00) - No Reply 6/14/19 01:40:45 6/16/19 12:28:45 0.00 0.0 / 0.0 BETA_ ARP1_ 0000441_ 001_ 0-- Microsoft Windows 8.1 x64 Edition, (06.03.9600.00) 721 Pending Validation 6/7/19 01:40:39 6/8/19 06:14:44 20.64 173.7 / 0.0 BETA_ ARP1_ 0000441_ 001_ 1-- Microsoft Windows 7 Professional x64 Edition, Service Pack 1, (06.01.7601.00) - No Reply 6/7/19 01:40:39 6/14/19 01:40:39 0.00 0.0 / 0.0 If testing "how many do not make it to the Beta short deadline check", you've convinced me. 60% failure rate. Not sure I understand your point. The last No Reply was after a week. The other 2 after 3 days. It’s beta testing. Short deadlines are sometimes the nature of the beast.
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------![]() ![]() [Edit 2 times, last edit by nanoprobe at Jun 19, 2019 3:02:08 PM] |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7844 Status: Offline Project Badges:
|
I finally snagged one of these (BETA_ARP1_0000442_004). It ran for a little over 30 hours on a Linux box. Checkpointed every six hours. CPU is Xeon X5650 hyperthreaded. It ran with 23 other Zika units. No problems.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
hchc said:
----------------------------------------Looks like each work unit simulates 48 hours, with checkpoints taken at: 06:00 12:00 18:00 00:00 06:00 12:00 18:00 On my i3-8100 @ 3.6 GHz, that's about 1-2.5 CPU hours between checkpoints. It's fine for a 24/7 device, but could these checkpoints maybe be doubled? So every 3 simulated hours instead of every 6 simulated hours. Is doubling the checkpoints to every 3 simulated hours feasible?
|
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Jonathan said
For this project, the only method available to validate results is to run redundant copies and check for binary equivalence. which makes me question the way this WU was handled:Result Name OS type OS version App Version Number Status Sent Time Time Due / The _0 copy was sent to an X86 machine, while the _1 went to an x64 machine. Surely it is unlikely that machines with 32-bit and 64-bit architectures will produce binary equivalent results, especially if floating point calculations are involved, or am I missing something? |
||
|
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1403 Status: Offline Project Badges:
|
Is doubling the checkpoints to every 3 simulated hours feasible? Very few people seem to understand that the data analyzed is not from a 6-hours period, but the the data of one fixed time e.g. 06:00 UTC or 12:00 UTC and not the period from 06 to 12. So check-pointing more often will be very hard or one have to run this on a virtual machine where one could make snapshots more often during the analyzing process. |
||
|
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 865 Status: Offline Project Badges:
|
Crystal Pellet said:
----------------------------------------Is doubling the checkpoints to every 3 simulated hours feasible? Very few people seem to understand that the data analyzed is not from a 6-hours period, but the the data of one fixed time e.g. 06:00 UTC or 12:00 UTC and not the period from 06 to 12. Thanks for explaining that these are instantaneous simulations. My question remains to the WRF developers: Can checkpoints be doubled or tripled? Slow devices may take 10 calendar hours between checkpoints. Even fast devices can take 1-2 hours between checkpoints. I'm wondering if any of the developers are monitoring the WCG forums.
|
||
|
|
|