Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 18
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 2292 times and has 17 replies Next Thread
hristo_h_m
Cruncher
Joined: Dec 10, 2008
Post Count: 22
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

I am careful with CEP2 wu. I take manual control of it. Whenever I see that wu downloaded and haven't 7 or 8 hours available for the crunch I suspend that task. Tomorrow morning I resume that task alone for the day. Oh, I know, this is annoying. So this is my number one post.

Usually this kind of task ended around 50% completion or 12 hours running.
Rarely to see a task going to 100% end around 19 hours.

So with this tactic I see CEP2 wu ended for less than two light days.
Without this care, wu last three or four days depending of whether there is reached checkpoint. Some game of chance, because this wu have virtue to restart each day at ground up and cumulative running time is around 32 hours.

The first checkpoint is reached around 36% or 6-7 hours continuous running.

.... Until recently.
Now CEP2 wu randomly self restarting and the game of chance begins.

I don't care for points or time to completion but I do care when I am not sure what is going on.
Is it wu waste processor time i.e. electricity power, because of bad programing?

I think that CEP2 wu cheats with the total running time. This is very wrong behavior not for wu but programing practice. I read some posts here and I feel that I am not alone.

I understand that I can't check all of this. Because anyway this processing time is free of charge and is wasted for good or bad. But this allows to not audit the quality of code with bad consequences. Not count me for this though.

So I am sure that it is a bug in CEP2 processing and I am almost sure that CEP2 wu just wasting processing time with that restarting, when virtualbox has ability to save the state of the task.

I want to appeal CEP2 staff to rewrite behavior of CEP2 wus.

So the problem.
I am using BOINC client version 7.6.9 for windows_x86_64 and VirtualBox version: 4.3.26. Plenty of resources.

11/26/2015 2:32:00 PM | | New system time (1448541120) < old system time (1448541186); clearing timeouts
11/26/2015 2:33:07 PM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 exited with zero status but no 'finished' file
11/26/2015 2:33:07 PM | World Community Grid | If this happens repeatedly you may need to reset the project.
11/26/2015 2:33:10 PM | World Community Grid | Sending scheduler request: Requested by project.
11/26/2015 2:33:10 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager
11/26/2015 2:33:13 PM | World Community Grid | Scheduler request completed
11/26/2015 3:30:45 PM | | New system time (1448544646) < old system time (1448544719); clearing timeouts
11/26/2015 3:31:59 PM | World Community Grid | Sending scheduler request: Requested by project.
11/26/2015 3:31:59 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager
11/26/2015 3:32:01 PM | World Community Grid | Scheduler request completed
11/26/2015 5:29:42 PM | | New system time (1448551782) < old system time (1448551845); clearing timeouts
11/26/2015 5:30:46 PM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 exited with zero status but no 'finished' file
11/26/2015 5:30:46 PM | World Community Grid | If this happens repeatedly you may need to reset the project.
11/26/2015 5:30:46 PM | World Community Grid | Sending scheduler request: Requested by project.
11/26/2015 5:30:46 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager
11/26/2015 5:30:49 PM | World Community Grid | Scheduler request completed
11/26/2015 8:27:25 PM | | New system time (1448562446) < old system time (1448562581); clearing timeouts
11/26/2015 8:29:43 PM | World Community Grid | Sending scheduler request: Requested by project.
11/26/2015 8:29:43 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager
11/26/2015 8:29:45 PM | World Community Grid | Scheduler request completed
11/26/2015 10:25:14 PM | | New system time (1448569515) < old system time (1448569645); clearing timeouts
11/26/2015 10:27:25 PM | World Community Grid | Task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 exited with zero status but no 'finished' file
11/26/2015 10:27:25 PM | World Community Grid | If this happens repeatedly you may need to reset the project.
11/26/2015 10:27:28 PM | World Community Grid | Sending scheduler request: Requested by project.
11/26/2015 10:27:28 PM | World Community Grid | Not requesting tasks: "no new tasks" requested via Manager
11/26/2015 10:27:31 PM | World Community Grid | Scheduler request completed
11/26/2015 10:28:42 PM | World Community Grid | task E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 suspended by user

The last reaching wcg state was 13 for job 0. This random restartings happens in that moment of clearing timeouts. NOT any error in processing wu or whatever internal.

here is stderr.txt
INFO: No state to restore. Start from the beginning.
[23:05:53] Number of jobs = 8
[23:05:53] Starting job 0,CPU time has been restored to 0.000000.
Quit requested: Exiting
[12:29:56] Number of jobs = 8
[12:29:56] Starting job 0,CPU time has been restored to 0.000000.
14:34:16 (3400): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[14:35:17] Number of jobs = 8
[14:35:17] Starting job 0,CPU time has been restored to 0.000000.
20:30:22 (5992): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[20:30:57] Number of jobs = 8
[20:30:57] Starting job 0,CPU time has been restored to 0.000000.
22:28:34 (5548): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[22:29:52] Number of jobs = 8
[22:29:52] Starting job 0,CPU time has been restored to 0.000000.
Quit requested: Exiting
[10:36:49] Number of jobs = 8
[10:36:49] Starting job 0,CPU time has been restored to 0.000000.
18:30:32 (1452): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[18:31:26] Number of jobs = 8
[18:31:26] Starting job 0,CPU time has been restored to 0.000000.
Quit requested: Exiting
[11:35:25] Number of jobs = 8
[11:35:25] Starting job 0,CPU time has been restored to 0.000000.
14:32:32 (2016): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[14:33:07] Number of jobs = 8
[14:33:07] Starting job 0,CPU time has been restored to 0.000000.
17:30:15 (4048): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[17:30:46] Number of jobs = 8
[17:30:46] Starting job 0,CPU time has been restored to 0.000000.
22:25:47 (4780): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[22:27:25] Number of jobs = 8
[22:27:25] Starting job 0,CPU time has been restored to 0.000000.

This wu is at deadline time and probably ended because of this.
I am not checked to participate in beta project and expect not to be involve in mass scale exercise of some amateur programing.

The donated processing time is quite big to not be taking into account.
So is it CEP2 project beta? Then there would have been reward for total time processing with or without good result returned.

Who is with me? Or I just quietly leave this project as many of you chose this side of dilemma exodus, volunteers.

Sorry in advance, for mistakes in language or style exposed but I am pretty sure the text is self comprehend and the idea will be grasped.
[Nov 26, 2015 10:29:23 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7545
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

CEP2 is an opt in project for a reason. CEP2 only checkpoints when each of its subtasks completes. Task "0" is the subtask which takes the longest time to checkpoint. If the task has not reached a checkpoint when you suspend or stop your system, it will revert to the beginning and start over. This is not bad programming. By the messages in your stderr.txt file it looks like your system may be too busy or is not able to keep up with the disk activity, as this is a very I/O intensive program. The message "No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting " I believe indicates the system is too busy.
This may not be the project which is right for you or your machine. If you care to tell us your machine specs and operating system, we may be able to make further recommendations. Hope this helps.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 27, 2015 2:34:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

Yes, the heartbeat issue is very much an system overload question. If the science app cannot communicate with the core client for 30 seconds, the operation is either reset by the core clinet or considered lost. [to prevent rogue / zombie processes].

In this case it looks like timekeeping is an issue too:

New system time (1448541120) < old system time (1448541186); clearing timeouts

Which means the clock was adjusted backwards by 66 seconds i.e > 30s differential.
[Nov 27, 2015 9:11:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hristo_h_m
Cruncher
Joined: Dec 10, 2008
Post Count: 22
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

Reply to Sgt. Joe [Nov 27, 2015 2:34:26 AM]
link https://secure.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508535

Okey I omited specs of the computer for short. But in short this computer

already did a few task without a problem, with my help to reach efficiency.

here is it.

cpu: q9450 at 2Ghz but for this task 2.66Ghz

OS: Microsoft Windows XP: Professional x64 Edition, Service Pack 2

mem: 8 GB

Disk free 130GB

11/27/2015 11:58:51 AM | World Community Grid | General prefs: using your

defaults
11/27/2015 11:58:51 AM | | Reading preferences override file
11/27/2015 11:58:51 AM | | Preferences:
11/27/2015 11:58:51 AM | | max memory usage when active: 4062.66MB
11/27/2015 11:58:51 AM | | max memory usage when idle: 7312.79MB
11/27/2015 11:59:01 AM | | max disk usage: 10.00GB
11/27/2015 11:59:01 AM | | (to change preferences, visit a project web site

or select Preferences in the Manager)
11/27/2015 11:59:01 AM | World Community Grid | Task

E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 is 0.15

days overdue; you may not get credit for it. Consider aborting it.
11/27/2015 11:59:01 AM | | Not using a proxy
11/27/2015 11:59:43 AM | World Community Grid | task

E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 aborted

by user
11/27/2015 11:59:44 AM | World Community Grid | Computation for task

E234874_784_S.312.C38H26N6O2.PVMFHTQPSZFMLJ-UHFFFAOYSA-N.12_s1_14_3 finished


Let me remained you the min requirements for this project.

CEP2 mem 1024MB disk 2048MB windows mac linux



I will add this info.

Yes, checkbox for leave in the memory box is filled. But it is irrelevant to

this issue.

Yes, there is a lot of disk IO activity. This is too irrelevant.

Boinc driving just this wu.

To the main point that I am trying to communicate to you. Whole used
memory for the task physical, let's say 150MB and virtual 250MB. Whole disk

space occupied 300MB. Make in a sume under 1GB min requirements.

So why is it so intensive disk IO operation when all can fit ram.
Then, the disk IO transfer not justified omission of retrievable checkpoints.

For example this is good using of disk IO anyway because CEP2 application

make min 2 GB write to disk for the time elapsed before checkpoint. This is 2

times the whole subproject min requirements. (Again emphasize Virtualbox can

make snapshot of whole state of the operating system with running

applications in it). But CEP2 project refuses this behavior but in the last

count reached such amount of writings to disk. Instead restarting and delete

of useful data on the disk. This is plain stupid, kill a disk especially if

it is SSD.

Internally application self restricted to using 512MB ram and tmp files of 32

GB. In my view all crunched task not get to this internal barriers, as most

of you can confirm.

So application not throw exception that it is busy with let be your favorite

IO disk activity. Not throwing anything indicating error, of any kind.
It's behavior is just kill a watt manner.

I repeat. There is something wrong. I read not just new posts but old ones and can claim this affirmative for many disoriented volunteers.
[Nov 27, 2015 11:15:59 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hristo_h_m
Cruncher
Joined: Dec 10, 2008
Post Count: 22
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

Reply to SekeRob* [Nov 27, 2015 9:11:06 AM]
link https://secure.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508540

/Quote
Which means the clock was adjusted backwards by 66 seconds i.e
> 30s differential.
/Quote

Do you really think that New system time (1448541120)... are seconds? There are more seconds here than in 30 years.

There is one way or another.

To write the piece of code and to write a piece of something else like CEP2 did.
There are two words speed and efficiency and nothing else in between matters.

I repeat myself chosen wu behavior is inefficient and far from optimal.

My specs halved will suffice them for a long.
[Nov 27, 2015 11:38:53 AM]   Link   Report threatening or abusive post: please login first  Go to top 
SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

Please be aware that BOINC does timekeeping in offset to the Unix start date 1-1-1970. Deduct new time from old time and you get the differential correction. Timekeeping is a *very* important piece in ensuring nothing goes bad or wild with the results, part of the GIGO prevention. Your system clearly has trouble with this key-part.

As was noted, the project is *opt-in* for reasons such as higher hardware requirements and longer up-time. Remedy the clock issue and you will mostly be fine running CEP2.
[Nov 27, 2015 11:47:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2069
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

Yes, 1448541120 is the number of seconds. The beginning of time (for computers) is 1 Jan 1970. This is called the "Unix epoch". See:
https://en.wikipedia.org/wiki/Unix_time

You can find the current Unix timestamp at http://www.unixtimestamp.com/. smile
----------------------------------------
[Edit 4 times, last edit by adriverhoef at Nov 27, 2015 12:39:22 PM]
[Nov 27, 2015 11:50:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
hristo_h_m
Cruncher
Joined: Dec 10, 2008
Post Count: 22
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

Replay to SekeRob* [Nov 27, 2015 11:47:47 AM]
link http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508547

and adriverhoef [Nov 27, 2015 11:50:03 AM]
link http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=508548

Yes, you are right! I was blinded by rage and forgot this knowledge.
For one reason. Because it is IRRELEVANT.

Let to keep to the topic.

So in essence, core client keep old style ticking, the boinc use more precise time keeping (or vice verse no matters) and after four hours this discrepancy raised to whopping 30 secs.

Or else, this may be due to the enabled in the bios HPET. High precision event timers which keeps time to microsecond range precision while ticking kernel to mili range. Both independent. So there is acquired difference of 30 sec in four hours. Strange for modern computer and software and thus totally implausible.

Is this irrelevant? YES, again.

I do comparative analysis which says me that other BOINC tasks survived clearing timeouts (the only visible effect is adjusting percent of work done), only CEP2 task can't do that, which is here to say something.

I repeat myself. CEP2 has problem with checkpoint (plus restarting). This leads to inefficiency.
For me this is obvious.
This is my last post. I already disable this project. Like I said I don't bother of badges and points.

But this project is not very different even with bug into it compare with MCM or OET, because they all use brute force methods allowed by distribution computing. And are algorithmically insufficient.

My arguments about MCM and OET are that this projects use stochastic variational method which is something like throwing darts and hope to shot global minimum when there are vast possibility of equal scaling minimums as is always the case with living structures. No such a thing like global minimum in such huge complex system!

What is the difference? This projects are fair to volunteers. And importantly keep the processing time elapsed. At least close to it.

Personally I don't believe that this projects are more meaningful than say SETI which is mark for meaningless. (Stop post I participate in it too). But this link must say something about all this.

http://www.gwern.net/Charity%20is%20not%20about%20helping

My coopinion there is

My 2cent in 5PFlops • 5 months ago
I agree with your opinion that Fold@home must take some actions and I would like to make this comments.
First, I contribute exclusively in winter. I don't care of points and teams and badges and papers (I don't read them, but I want to think that they are good). I contribute about 70 W x 8h x 3 months.
Secondly, I use for this only my "new" computer. My video is 40nm 15 W Evergreen, and proc is 45nm. 65 W Q8200 @ 2 Gh. I don't use any heaters farms as Pentium4 or athlonXP. 24/7. if it is not an Ice Age.
Third, I contribute when I am actually using my computer. In Idle it consumes let say 120 W and max Load 200W. But actually run one WU at a time.

If everyone is as me, then your math is not very well with numbers. It is good reading for every folder though.

end of quote

I repeat that the project needs to be sufficiently efficient to be unleashed in the wild.

The CEP2 is not and there missed (or mis past time whatever) debuging information. Because it has beta behavior but not beta status.

I waite for relevant posts.

Cheers crunchers
[Nov 27, 2015 1:05:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7545
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

The CEP2 project uses software called Q-Chem. It is quantum computational software for molecular modeling. It is a complex piece of software. If you are familiar with programming in C,C++ and Fortran and have knowledge of quantum chemistry I am sure the CEP2 team would love to have your input on how to improve the algorithms for this proprietary program for greater efficiency.
The cleanenergy scientist explains a little of the program here .
In regards to your hardware. The Q9450 is a slightly faster chip than a Q6600. I have a Q6600 and would not think of running CEP2 on it because it is just too slow to be effective especially if my machine was only on for 12 hours at a time. Some of the smaller molecules would work, but the bigger ones would time out or not be able to checkpoint the the first task within the time period the machine is turned on. Also I note you are using XP service pack 2. I believe the latest service pack for XP was 4 at the time Microsoft pulled support for this software.
I am sure your contributions for CEP2 are appreciated, but given the constraints of your hardware, software and the amount of time your machine is turned on, other projects within WCG are probably more appropriate given your circumstances and will result in greater efficiencies for you.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Nov 27, 2015 4:27:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Jim1348
Veteran Cruncher
USA
Joined: Jul 13, 2009
Post Count: 1066
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Appeal to CEP2 staff to react to wrong wu behavior and volunteers vote

VirtualBox may be a good idea for a variety of reason, but I am sure they are not going to adopt it now in mid-stream; maybe the next time around. I used to use a Q6600, but it was on 24/7, and I don't know how it would work now with the current work units. If you don't understand why the project is the way it is, I recommend you read the earliest posts. There is a reason this is the only project with a maximum "Number of work units per host setting".

EDIT: Note the very high write-rate to disk for this project. A Q6600 will do hundreds of gigabytes per day (close to 1TB/day) if all 4 cores are used. That is hard on a mechanical drive, and early death to an SSD (though you might get a few years out of it). It may account for the heartbeat issue also. I put the BOINC Data folder on a ramdisk for that reason, or a large write-cache will work. The error rate is practically zero on my machines.
----------------------------------------
[Edit 2 times, last edit by Jim1348 at Nov 27, 2015 5:28:25 PM]
[Nov 27, 2015 5:08:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread