Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 18
|
Author |
|
hristo_h_m
Cruncher Joined: Dec 10, 2008 Post Count: 22 Status: Offline Project Badges: |
I am careful with CEP2 wu. I take manual control of it. Whenever I see that wu downloaded and haven't 7 or 8 hours available for the crunch I suspend that task. Tomorrow morning I resume that task alone for the day. Oh, I know, this is annoying. So this is my number one post.
Usually this kind of task ended around 50% completion or 12 hours running. Rarely to see a task going to 100% end around 19 hours. So with this tactic I see CEP2 wu ended for less than two light days. Without this care, wu last three or four days depending of whether there is reached checkpoint. Some game of chance, because this wu have virtue to restart each day at ground up and cumulative running time is around 32 hours. The first checkpoint is reached around 36% or 6-7 hours continuous running. .... Until recently. Now CEP2 wu randomly self restarting and the game of chance begins. I don't care for points or time to completion but I do care when I am not sure what is going on. Is it wu waste processor time i.e. electricity power, because of bad programing? I think that CEP2 wu cheats with the total running time. This is very wrong behavior not for wu but programing practice. I read some posts here and I feel that I am not alone. I understand that I can't check all of this. Because anyway this processing time is free of charge and is wasted for good or bad. But this allows to not audit the quality of code with bad consequences. Not count me for this though. So I am sure that it is a bug in CEP2 processing and I am almost sure that CEP2 wu just wasting processing time with that restarting, when virtualbox has ability to save the state of the task. I want to appeal CEP2 staff to rewrite behavior of CEP2 wus. So the problem. I am using BOINC client version 7.6.9 for windows_x86_64 and VirtualBox version: 4.3.26. Plenty of resources. 11/26/2015 2:32:00 PM | | New system time (1448541120) < old system time (1448541186); clearing timeouts 11/26/2015 2:33:07 PM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 exited with zero status but no 'finished' file 11/26/2015 2:33:07 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 11/26/2015 2:33:10 PM | World Community Grid | Sending scheduler request: Requested by project. 11/26/2015 2:33:10 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager 11/26/2015 2:33:13 PM | World Community Grid | Scheduler request completed 11/26/2015 3:30:45 PM | | New system time (1448544646) < old system time (1448544719); clearing timeouts 11/26/2015 3:31:59 PM | World Community Grid | Sending scheduler request: Requested by project. 11/26/2015 3:31:59 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager 11/26/2015 3:32:01 PM | World Community Grid | Scheduler request completed 11/26/2015 5:29:42 PM | | New system time (1448551782) < old system time (1448551845); clearing timeouts 11/26/2015 5:30:46 PM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 exited with zero status but no 'finished' file 11/26/2015 5:30:46 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 11/26/2015 5:30:46 PM | World Community Grid | Sending scheduler request: Requested by project. 11/26/2015 5:30:46 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager 11/26/2015 5:30:49 PM | World Community Grid | Scheduler request completed 11/26/2015 8:27:25 PM | | New system time (1448562446) < old system time (1448562581); clearing timeouts 11/26/2015 8:29:43 PM | World Community Grid | Sending scheduler request: Requested by project. 11/26/2015 8:29:43 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager 11/26/2015 8:29:45 PM | World Community Grid | Scheduler request completed 11/26/2015 10:25:14 PM | | New system time (1448569515) < old system time (1448569645); clearing timeouts 11/26/2015 10:27:25 PM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 exited with zero status but no 'finished' file 11/26/2015 10:27:25 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 11/26/2015 10:27:28 PM | World Community Grid | Sending scheduler request: Requested by project. 11/26/2015 10:27:28 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager 11/26/2015 10:27:31 PM | World Community Grid | Scheduler request completed 11/26/2015 10:28:42 PM | World Community Grid | task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 suspended by user The last reaching wcg state was 13 for job 0. This random restartings happens in that moment of clearing timeouts. NOT any error in processing wu or whatever internal. here is stderr.txt INFO: No state to restore. Start from the beginning. [23:05:53] Number of jobs = 8 [23:05:53] Starting job 0,CPU time has been restored to 0.000000. Quit requested: Exiting [12:29:56] Number of jobs = 8 [12:29:56] Starting job 0,CPU time has been restored to 0.000000. 14:34:16 (3400): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [14:35:17] Number of jobs = 8 [14:35:17] Starting job 0,CPU time has been restored to 0.000000. 20:30:22 (5992): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [20:30:57] Number of jobs = 8 [20:30:57] Starting job 0,CPU time has been restored to 0.000000. 22:28:34 (5548): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [22:29:52] Number of jobs = 8 [22:29:52] Starting job 0,CPU time has been restored to 0.000000. Quit requested: Exiting [10:36:49] Number of jobs = 8 [10:36:49] Starting job 0,CPU time has been restored to 0.000000. 18:30:32 (1452): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [18:31:26] Number of jobs = 8 [18:31:26] Starting job 0,CPU time has been restored to 0.000000. Quit requested: Exiting [11:35:25] Number of jobs = 8 [11:35:25] Starting job 0,CPU time has been restored to 0.000000. 14:32:32 (2016): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [14:33:07] Number of jobs = 8 [14:33:07] Starting job 0,CPU time has been restored to 0.000000. 17:30:15 (4048): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [17:30:46] Number of jobs = 8 [17:30:46] Starting job 0,CPU time has been restored to 0.000000. 22:25:47 (4780): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [22:27:25] Number of jobs = 8 [22:27:25] Starting job 0,CPU time has been restored to 0.000000. This wu is at deadline time and probably ended because of this. I am not checked to participate in beta project and expect not to be involve in mass scale exercise of some amateur programing. The donated processing time is quite big to not be taking into account. So is it CEP2 project beta? Then there would have been reward for total time processing with or without good result returned. Who is with me? Or I just quietly leave this project as many of you chose this side of dilemma exodus, volunteers. Sorry in advance, for mistakes in language or style exposed but I am pretty sure the text is self comprehend and the idea will be grasped. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7545 Status: Offline Project Badges: |
CEP2 is an opt in project for a reason. CEP2 only checkpoints when each of its subtasks completes. Task "0" is the subtask which takes the longest time to checkpoint. If the task has not reached a checkpoint when you suspend or stop your system, it will revert to the beginning and start over. This is not bad programming. By the messages in your stderr.txt file it looks like your system may be too busy or is not able to keep up with the disk activity, as this is a very I/O intensive program. The message "No heartbeat from core client for 30 sec - exiting
----------------------------------------No heartbeat: Exiting " I believe indicates the system is too busy. This may not be the project which is right for you or your machine. If you care to tell us your machine specs and operating system, we may be able to make further recommendations. Hope this helps. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Yes, the heartbeat issue is very much an system overload question. If the science app cannot communicate with the core client for 30 seconds, the operation is either reset by the core clinet or considered lost. [to prevent rogue / zombie processes].
In this case it looks like timekeeping is an issue too: New system time (1448541120) < old system time (1448541186); clearing timeouts Which means the clock was adjusted backwards by 66 seconds i.e > 30s differential. |
||
|
hristo_h_m
Cruncher Joined: Dec 10, 2008 Post Count: 22 Status: Offline Project Badges: |
Reply to Sgt. Joe [Nov 27, 2015 2:34:26 AM]
link https://secure.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508535 Okey I omited specs of the computer for short. But in short this computer already did a few task without a problem, with my help to reach efficiency. here is it. cpu: q9450 at 2Ghz but for this task 2.66Ghz OS: Microsoft Windows XP: Professional x64 Edition, Service Pack 2 mem: 8 GB Disk free 130GB 11/27/2015 11:58:51 AM | World Community Grid | General prefs: using your defaults 11/27/2015 11:58:51 AM | | Reading preferences override file 11/27/2015 11:58:51 AM | | Preferences: 11/27/2015 11:58:51 AM | | max memory usage when active: 4062.66MB 11/27/2015 11:58:51 AM | | max memory usage when idle: 7312.79MB 11/27/2015 11:59:01 AM | | max disk usage: 10.00GB 11/27/2015 11:59:01 AM | | (to change preferences, visit a project web site or select Preferences in the Manager) 11/27/2015 11:59:01 AM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 is 0.15 days overdue; you may not get credit for it. Consider aborting it. 11/27/2015 11:59:01 AM | | Not using a proxy 11/27/2015 11:59:43 AM | World Community Grid | task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 aborted by user 11/27/2015 11:59:44 AM | World Community Grid | Computation for task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 finished Let me remained you the min requirements for this project. CEP2 mem 1024MB disk 2048MB windows mac linux I will add this info. Yes, checkbox for leave in the memory box is filled. But it is irrelevant to this issue. Yes, there is a lot of disk IO activity. This is too irrelevant. Boinc driving just this wu. To the main point that I am trying to communicate to you. Whole used memory for the task physical, let's say 150MB and virtual 250MB. Whole disk space occupied 300MB. Make in a sume under 1GB min requirements. So why is it so intensive disk IO operation when all can fit ram. Then, the disk IO transfer not justified omission of retrievable checkpoints. For example this is good using of disk IO anyway because CEP2 application make min 2 GB write to disk for the time elapsed before checkpoint. This is 2 times the whole subproject min requirements. (Again emphasize Virtualbox can make snapshot of whole state of the operating system with running applications in it). But CEP2 project refuses this behavior but in the last count reached such amount of writings to disk. Instead restarting and delete of useful data on the disk. This is plain stupid, kill a disk especially if it is SSD. Internally application self restricted to using 512MB ram and tmp files of 32 GB. In my view all crunched task not get to this internal barriers, as most of you can confirm. So application not throw exception that it is busy with let be your favorite IO disk activity. Not throwing anything indicating error, of any kind. It's behavior is just kill a watt manner. I repeat. There is something wrong. I read not just new posts but old ones and can claim this affirmative for many disoriented volunteers. |
||
|
hristo_h_m
Cruncher Joined: Dec 10, 2008 Post Count: 22 Status: Offline Project Badges: |
Reply to SekeRob* [Nov 27, 2015 9:11:06 AM]
link https://secure.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508540 /Quote Which means the clock was adjusted backwards by 66 seconds i.e > 30s differential. /Quote Do you really think that New system time (1448541120)... are seconds? There are more seconds here than in 30 years. There is one way or another. To write the piece of code and to write a piece of something else like CEP2 did. There are two words speed and efficiency and nothing else in between matters. I repeat myself chosen wu behavior is inefficient and far from optimal. My specs halved will suffice them for a long. |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Please be aware that BOINC does timekeeping in offset to the Unix start date 1-1-1970. Deduct new time from old time and you get the differential correction. Timekeeping is a *very* important piece in ensuring nothing goes bad or wild with the results, part of the GIGO prevention. Your system clearly has trouble with this key-part.
As was noted, the project is *opt-in* for reasons such as higher hardware requirements and longer up-time. Remedy the clock issue and you will mostly be fine running CEP2. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2069 Status: Recently Active Project Badges: |
Yes, 1448541120 is the number of seconds. The beginning of time (for computers) is 1 Jan 1970. This is called the "Unix epoch". See:
----------------------------------------https://en.wikipedia.org/wiki/Unix_time You can find the current Unix timestamp at http://www.unixtimestamp.com/. [Edit 4 times, last edit by adriverhoef at Nov 27, 2015 12:39:22 PM] |
||
|
hristo_h_m
Cruncher Joined: Dec 10, 2008 Post Count: 22 Status: Offline Project Badges: |
Replay to SekeRob* [Nov 27, 2015 11:47:47 AM]
link http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508547 and adriverhoef [Nov 27, 2015 11:50:03 AM] link http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508548 Yes, you are right! I was blinded by rage and forgot this knowledge. For one reason. Because it is IRRELEVANT. Let to keep to the topic. So in essence, core client keep old style ticking, the boinc use more precise time keeping (or vice verse no matters) and after four hours this discrepancy raised to whopping 30 secs. Or else, this may be due to the enabled in the bios HPET. High precision event timers which keeps time to microsecond range precision while ticking kernel to mili range. Both independent. So there is acquired difference of 30 sec in four hours. Strange for modern computer and software and thus totally implausible. Is this irrelevant? YES, again. I do comparative analysis which says me that other BOINC tasks survived clearing timeouts (the only visible effect is adjusting percent of work done), only CEP2 task can't do that, which is here to say something. I repeat myself. CEP2 has problem with checkpoint (plus restarting). This leads to inefficiency. For me this is obvious. This is my last post. I already disable this project. Like I said I don't bother of badges and points. But this project is not very different even with bug into it compare with MCM or OET, because they all use brute force methods allowed by distribution computing. And are algorithmically insufficient. My arguments about MCM and OET are that this projects use stochastic variational method which is something like throwing darts and hope to shot global minimum when there are vast possibility of equal scaling minimums as is always the case with living structures. No such a thing like global minimum in such huge complex system! What is the difference? This projects are fair to volunteers. And importantly keep the processing time elapsed. At least close to it. Personally I don't believe that this projects are more meaningful than say SETI which is mark for meaningless. (Stop post I participate in it too). But this link must say something about all this. http://www.gwern.net/Charity%20is%20not%20about%20helping My coopinion there is My 2cent in 5PFlops ⢠5 months ago I agree with your opinion that Fold@home must take some actions and I would like to make this comments. First, I contribute exclusively in winter. I don't care of points and teams and badges and papers (I don't read them, but I want to think that they are good). I contribute about 70 W x 8h x 3 months. Secondly, I use for this only my "new" computer. My video is 40nm 15 W Evergreen, and proc is 45nm. 65 W Q8200 @ 2 Gh. I don't use any heaters farms as Pentium4 or athlonXP. 24/7. if it is not an Ice Age. Third, I contribute when I am actually using my computer. In Idle it consumes let say 120 W and max Load 200W. But actually run one WU at a time. If everyone is as me, then your math is not very well with numbers. It is good reading for every folder though. end of quote I repeat that the project needs to be sufficiently efficient to be unleashed in the wild. The CEP2 is not and there missed (or mis past time whatever) debuging information. Because it has beta behavior but not beta status. I waite for relevant posts. Cheers crunchers |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7545 Status: Offline Project Badges: |
The CEP2 project uses software called Q-Chem. It is quantum computational software for molecular modeling. It is a complex piece of software. If you are familiar with programming in C,C++ and Fortran and have knowledge of quantum chemistry I am sure the CEP2 team would love to have your input on how to improve the algorithms for this proprietary program for greater efficiency.
----------------------------------------The cleanenergy scientist explains a little of the program here . In regards to your hardware. The Q9450 is a slightly faster chip than a Q6600. I have a Q6600 and would not think of running CEP2 on it because it is just too slow to be effective especially if my machine was only on for 12 hours at a time. Some of the smaller molecules would work, but the bigger ones would time out or not be able to checkpoint the the first task within the time period the machine is turned on. Also I note you are using XP service pack 2. I believe the latest service pack for XP was 4 at the time Microsoft pulled support for this software. I am sure your contributions for CEP2 are appreciated, but given the constraints of your hardware, software and the amount of time your machine is turned on, other projects within WCG are probably more appropriate given your circumstances and will result in greater efficiencies for you. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges: |
VirtualBox may be a good idea for a variety of reason, but I am sure they are not going to adopt it now in mid-stream; maybe the next time around. I used to use a Q6600, but it was on 24/7, and I don't know how it would work now with the current work units. If you don't understand why the project is the way it is, I recommend you read the earliest posts. There is a reason this is the only project with a maximum "Number of work units per host setting".
----------------------------------------EDIT: Note the very high write-rate to disk for this project. A Q6600 will do hundreds of gigabytes per day (close to 1TB/day) if all 4 cores are used. That is hard on a mechanical drive, and early death to an SSD (though you might get a few years out of it). It may account for the heartbeat issue also. I put the BOINC Data folder on a ramdisk for that reason, or a large write-cache will work. The error rate is practically zero on my machines. [Edit 2 times, last edit by Jim1348 at Nov 27, 2015 5:28:25 PM] |
||
|
|