Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Locked
Total posts in this thread: 12
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2799 times and has 11 replies Next Thread
GenCom.org
Cruncher
Joined: Nov 27, 2006
Post Count: 17
Status: Offline
[Closed] All HPF2 Workunits with Status "Error" since this morning

Hi,

Since this morning, I've a lot of Human Preoteome Folding 2 workunits with status "Error" and in boinc agent log all of these finished correctly.


here is some examples :

la652_ 00084-- Error 03/20/2007 07:43:54 03/22/2007 07:29:04 11.80 76.1 / 0.0
la646_ 00052-- Error 03/19/2007 21:17:04 03/22/2007 07:20:12 3.82 75.4 / 0.0
la663_ 00025-- Error 03/20/2007 22:36:12 03/22/2007 06:33:01 6.76 64.0 / 0.0
la649_ 00020-- Error 03/20/2007 00:45:21 03/22/2007 06:15:29 3.37 70.7 / 0.0
la645_ 00030-- Error 03/19/2007 19:59:52 03/22/2007 05:45:27 3.62 71.5 / 0.0
la644_ 00007-- Error 03/19/2007 19:05:53 03/22/2007 05:40:16 4.06 80.0 / 0.0
la636_ 00011-- Error 03/18/2007 19:31:03 03/22/2007 05:39:25 15.07 98.5 / 0.0
la653_ 00088-- Error 03/20/2007 09:31:02 03/22/2007 04:23:45 8.81 93.0 / 0.0
la647_ 00018-- Error 03/19/2007 21:07:29 03/22/2007 02:55:29 3.23 67.9 / 0.0
la647_ 00020-- Error 03/19/2007 21:07:29 03/22/2007 02:50:18 3.22 67.6 / 0.0

and Boinc Log :

22/03/2007 08:10:18|World Community Grid|Computation for task la646_00052_17 finished
22/03/2007 07:09:47|World Community Grid|Computation for task la649_00020_12 finished
22/03/2007 05:05:42|World Community Grid|Computation for task la647_00006_9 finished

Any idea ?

Regards
----------------------------------------

----------------------------------------
[Edit 4 times, last edit by GenCom.org at Mar 22, 2007 3:32:58 PM]
[Mar 22, 2007 8:19:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

Is this 1 machine (quad e.g) or more (dual core / single core)?
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 22, 2007 9:30:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
GenCom.org
Cruncher
Joined: Nov 27, 2006
Post Count: 17
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

it is on several computers (QX6700, E6300, E4300, Xeon 3.0, P-IV 3.2 , Sempron 2800+, ...), it seems that theses errors started at this unit :

la652_ 00052-- Erreur 20/03/2007 06:34:34 21/03/2007 11:36:01 6,17 70,2 / 0,0
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by GenCom.org at Mar 22, 2007 9:39:08 AM]
[Mar 22, 2007 9:33:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

All same antivirus program and a definition update that happened this morning?

The spread of dates received and the time block of completion/return since errors started suggest a local problem. No one else has reported this so far.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 22, 2007 9:43:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
GenCom.org
Cruncher
Joined: Nov 27, 2006
Post Count: 17
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

not the same for all, some computers use Avast, some Etrust, and some run without any antivirus biggrin

have a look to la658_ 00071 or la658_ 00025 or la657_ 00032 or la652_ 00052 detail, you will see that everyone have an error status
----------------------------------------

----------------------------------------
[Edit 2 times, last edit by GenCom.org at Mar 22, 2007 10:03:48 AM]
[Mar 22, 2007 9:56:17 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

not the same for all, some computers use Avast, some Etrust, and some run without any antivirus biggrin

have a look to la652_ 00052 detail, you will see that everyone have an error status


Now that last bit is very useful info..... a basic verification to see if local or spread, by checking the result status detail. That's why I've compiled the 'Issue?' Q&A under the link in my signature.

Will alert staff!, meantime if u can post a sample of the quorum detail as i cannot see it... only the techs who are still on one ear.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Mar 22, 2007 10:01:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
GenCom.org
Cruncher
Joined: Nov 27, 2006
Post Count: 17
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

some units are now with status too Late !

la648_ 00045 Too Late 03/19/2007 22:35:15 03/22/2007 10:03:38 3.63 71.7 / 0.0

Here is the detail for this one (https://secure.worldcommunitygrid.org/ms/devi...tus.do?workunitId=4555271)

la648_ 00045-- Erreur 21/03/2007 01:43:02 21/03/2007 12:07:57 6,95 87,2 / 0,0
la648_ 00045-- Erreur 20/03/2007 12:52:10 21/03/2007 07:48:23 3,57 50,5 / 0,0
la648_ 00045-- Erreur 20/03/2007 03:30:17 20/03/2007 13:53:19 5,41 64,5 / 0,0
la648_ 00045-- Erreur 20/03/2007 00:52:24 21/03/2007 01:38:30 0,00 0,0 / 0,0
la648_ 00045-- Erreur 19/03/2007 23:30:15 21/03/2007 00:20:42 8,54 66,5 / 0,0
la648_ 00045-- Erreur 19/03/2007 23:06:48 20/03/2007 03:29:08 2,68 17,2 / 0,0
la648_ 00045-- En cours 19/03/2007 23:02:34 28/03/2007 23:02:34 0,00 0,0 / 0,0
la648_ 00045-- Erreur 19/03/2007 23:02:21 20/03/2007 13:40:36 7,65 71,5 / 0,0
la648_ 00045-- En cours 19/03/2007 22:57:27 28/03/2007 22:57:27 0,00 0,0 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:56:46 20/03/2007 20:49:59 11,95 84,1 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:49:27 21/03/2007 18:07:25 4,79 53,9 / 0,0
la648_ 00045-- Trop tard 19/03/2007 22:35:15 22/03/2007 10:03:38 3,63 71,7 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:34:02 20/03/2007 08:15:57 8,19 76,4 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:29:26 22/03/2007 01:14:57 11,34 71,4 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:27:56 20/03/2007 17:48:12 6,74 50,0 / 0,0
la648_ 00045-- En cours 19/03/2007 22:27:44 28/03/2007 22:27:44 0,00 0,0 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:26:47 20/03/2007 22:55:51 11,53 59,4 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:26:37 21/03/2007 05:35:47 14,02 67,6 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:25:36 20/03/2007 12:47:55 8,33 15,0 / 0,0
la648_ 00045-- Erreur 19/03/2007 22:25:14 20/03/2007 04:07:30 3,63 64,6 / 0,0
la648_ 00045-- En cours 19/03/2007 22:18:26 28/03/2007 22:18:26 0,00 0,0 / 0,0
----------------------------------------

[Mar 22, 2007 10:07:56 AM]   Link   Report threatening or abusive post: please login first  Go to top 
olympic
Senior Cruncher
Joined: Jun 12, 2005
Post Count: 156
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

Same here, I'm seeing approximately a 50% error rate with HPF2. The WU takes the full amount of CPU time and finishes normally with no error messages in the BOINC log. I also have one listed as "too late" even though it was returned within 3 days.

I had turned off HPF2 a while back due to occasional errors, boy did I pick a bad time to turn it back on! crying Aborting the rest in queue ASAP.
----------------------------------------

[Mar 22, 2007 10:58:23 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Re: A Lot of Workunits with Status "Error" since this morning

It's very possibly just a server side validation process error as the 'too late' simply should not be. Suggest to Suspend, rather than Abort for now.

U may not have seen it yet, but WU's can now remotely be aborted with version 5.8 and up (but only if the client initiated the contact with the servers!!!!) See here for how it works/worked for Genome Comparison: http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=12459#90438

The remote abort routine was introduced on accelerated development push by WCG. Client side the function existed, but server side not..... we're on BOINC server v 5.09 now according the message logs.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 2 times, last edit by Sekerob at Mar 22, 2007 2:11:47 PM]
[Mar 22, 2007 11:01:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
GenCom.org
Cruncher
Joined: Nov 27, 2006
Post Count: 17
Status: Offline
smile Re: A Lot of Workunits with Status "Error" since this morning

seems that every is working again, great job
----------------------------------------

[Mar 22, 2007 3:32:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 12   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread