| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 49
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Well, it is the weekend. I expect the errors will be examined bright and early on Monday.
New projects are always going to have some teething troubles, but we should get past it in a couple of days. We have a great team at WCG supporting the project scientists' efforts. |
||
|
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges:
|
observed at one time several hours of 'frozen' percentage. I did notice that too. I'm running the BOINC client, version 5.4.9, under GNU/Linux on a P4 2.6 GHz with 512 MB of RAM. I'm posting here because I've just seen a freeze, and this printed in my terminal: *** glibc detected *** corrupted double-linked list: 0x0984cf90 *** *** glibc detected *** corrupted double-linked list: 0x09606228 *** Corrupted double-linked lists are indeed likely to cause a freeze and leave the application in an incorrect state. Closing the BOINC client is the only way to fix the freeze: several days ago, I left the agent without having a look at it for at least 24h, and the particular WU wouldn't de-freeze (same name, same time spent, same percentage). Freezes did seldom happen with FAAH, but it does happen at nearly every WU switch that shuts down a HPF2 WU. What's more, I had never seen such a message in my terminal before today... These are not proofs that the issue is in the HPF2 application, though. For now, I have set up my preferences so as NOT to receive HPF2 WUs until this is sorted out, since it happens mostly with HPF2 WUs... I know this temporary measure is not satisfactory, but I'd rather not participate (passed the 1 year mark several weeks ago) than return erroneous results... |
||
|
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Debroux Lionel, let me qualify that freezing comment which can be 3 things i know of and maybe a fourth.
----------------------------------------1. The "Freezing of percent" means, that in the case of non linear computing and the WU's being chopped up in little internal segments, it can happen that one segment takes a very long time, thus the percent indicator not progressing....patience required. 2. The BOINCmanager (the front end) loosing contact with the science backend. In my case i've killed it many times. The science continues in the background and killing & restarting BOINC manager AND BOINCmgr.exe only, picks up where the science had progressed. Resolution in my case was adding the ports 443, 1043 and 31406 to the firewall exceptions for the BOINC.exe, BOINCmgr.exe and the science parts (the latter have no exe extension and may be overkill to add). 3. The true freezing. You should be able to see in the Taskmanager if the Science has become non-responsive. The CPU time counter in that case is likely frozen....then you can kill and hope it has not damaged the work unit.....else it gets the completed message send back and awarded the 'error' lable. 4. DonNo One resolution mentioned was to set in your WCG BOINC profile to retain the WU in memory while pre-empting. Emperically, i have no need for it, but others have. cheers
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges:
|
Thanks for your reply.
----------------------------------------My problem is clearly 3, true freezing ("same name, same time spent, same percentage" in my previous message). I have hardly ever seen non-linear computing (1) on my 1000+ WUs; no firewall exceptions have ever been required for local ports (1043, 31416) on my GNU/Linux (and 443 - HTTPS - is allowed, obviously) (2). And those messages: *** glibc detected *** corrupted double-linked list: ... *** indicate a bug, they never appeared before and should never appear. BTW, since I posted my previous message, I had a freeze again... and a new *** glibc detected *** corrupted double-linked list: ... *** message in the terminal... ---------------------------------------- [Edit 1 times, last edit by DEBROUX Lionel at Jun 26, 2006 1:43:05 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello DEBROUX Lionel,
If you are getting a double-linked list, have you checked your file system on your disk using RUN chkdsk /F ? Lawrence |
||
|
|
teletran
Senior Cruncher Joined: Jul 27, 2005 Post Count: 378 Status: Offline |
Out of 3 work units I have one inconclusive, one error and one valid. Hope we hear something today about the new work units and any problems with them.
---------------------------------------- |
||
|
|
debrouxl
Advanced Cruncher France Joined: Dec 31, 2004 Post Count: 61 Status: Offline Project Badges:
|
If you are getting a double-linked list, have you checked your file system on your disk using RUN chkdsk /F ? Good idea. I just ran fsck.vfat -r -v, since I'm using GNU/Linux, and the BOINC agent is on a FAT32 partition: no problems found. I have stumbled across only one other application that used to generate "corrupted double-linked list" errors, and it was a bug in the application. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Out of 3 work units I have one inconclusive, one error and one valid. Hope we hear something today about the new work units and any problems with them. I have had 3 error out on me in the last 2 days I never had 1 error while I was running FAAH exclusively I am seriously considering opting out of HPF2 if this continues |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Got 5 inconclusives but no errors.
the log says.... Result Log <core_client_version>5.4.9</core_client_version> <stderr_txt> Failed to open wcg_checkpoint.dat for reading. rc: 2. File doesn't exist? Failed to open wcg_hpf2.random for reading. rc: 2. File doesn't exist? Failed to open wcg_hpf2.random for reading. rc: 2. File doesn't exist? Rosetta finishing with return code: 0 </stderr_txt> I have switched off HPF2 for now till the techs see what is going on. |
||
|
|
teletran
Senior Cruncher Joined: Jul 27, 2005 Post Count: 378 Status: Offline |
Graham,
----------------------------------------I've already opted out of HPF2 until we hear something. Luckily we have other work here to keep us going :) ---------------------------------------- [Edit 1 times, last edit by teletran at Jun 26, 2006 5:48:14 PM] |
||
|
|
|