Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 38
Posts: 38   Pages: 4   [ Previous Page | 1 2 3 4 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 311394 times and has 37 replies Next Thread
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

@ [B.S] sTrey: My similar problem was caused by the model of the HDD on which I had the BOINC data. Replacing the drive by an older, slower one that has a smaller onboard cache has solved the problem. I can now run 4 simultaneous CEP2 WUs reasonably happily with 2GB RAM under XP-32. With 4 CEP2s there is much activity on the BOINC data drive, and almost none detectable on the drive with the system & pagefile, and there are many brief instances where CPU usage drops during the HDD activity. No WUs have been "Bogart"ed since I changed the drive. I prefer to run 1 CEP2 WU and 3 WUs from other projects, because much less CPU time is lost by the dropouts.
More details in my posts in work units not finishing .
[Mar 15, 2011 2:33:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

Dear Rickjb,
That is a very odd and counter-intuitive behavior. Just for curiosity - how was the previous hd connected? Is it a regular SATA? Any general compatibility issues with the motherboard or card? It was not an external drive, right?
Best wishes
Your Harvard CEP team
[Mar 16, 2011 4:14:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

No compatibility issues with motherboard: Asus P5Q Deluxe - LGA775, with Intel P45 chipset including ICH10 Southbridge, using the ICH10 SATA controller in either IDE or AHCI Mode. The WD "Caviar Green" drive is a regular SATA II unit. It is meant to be an "internal" drive, but it's actually "external" because it & the rest of the machine are sitting out on a table. smile
The WD drive seemed to freeze the system for longer and to a much greater extent than the IDE units. For example, with the WD, the menu bar of the BOINC Manager window would white out when I tried to access it, but this window remains quite responsive during periods of intense HDD activity with the IDE drives. I think the WD drive was also preventing the WCG science apps from running enough to send their heartbeat signals back to BOINC, so they timed out & were killed.
I'd need more expertise and diagnostic tools to analyse what is happening at the detailed hardware level. I just know that 2 different IDE drives with smaller caches have solved the problem. (The "other" IDE drive was the current 8.4GB Seagate system drive, which fixed the problem when it held the BOINC data to run DDDT2, but which is too small for CEP2). Please see my posts in Legrandpiou's work units not finishing thread.
[Mar 16, 2011 7:10:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

Just an update, I setup one Intel 980x machine with a large RAMdisk and placed the BOINC data folder on it and it has been happily crunching 12 threads of CEP2 24/7 for the past couple months. CPU efficiency is greater than 99%. The OS(Server 2003 x64), including pagefile, is on a traditional 7200RPM SATA HD.

The catch is I needed 24GB of DDR3 to do it. The minimum size needed for the RAMdisk is about 13GB for 12 threads to operate properly. I have mine set at 16GB. Also, most descent RAMdisk software is not free and running one causes lots of delay when starting up and shutting down the machine. There's also the risk of losing the entire contents of the RAMdisk should the machine crash or be shut down unexpectedly.
[Mar 16, 2011 7:23:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

The start-up of a CEP2 workunit involves unzipping a bunch of files. This is a VERY IO intensive process. On some machines under some conditions, this will likely cause the research application to be non-responsive to heartbeat messages. I am assuming that BOINC decides that CEP2 isn't responding and thus kills it. If the IO is still heavy and the other applications that are loaded into memory (LAIM) have been swapped out, then we are going to see the same effect there because the apps are unable to get loaded back in memory before they are killed for being unresponsive.

To test this, can you do the following:

Create the cc_config.xml file outlined here: http://boinc.berkeley.edu/wiki/Client_configuration

Enable the following flags:

<cpu_sched>
<cpu_sched_debug>

It will look like:

<cc_config>
<log_flags>
<cpu_sched>1</cpu_sched>
<cpu_sched_debug>1</cpu_sched_debug>
</log_flags>
<options>
</options>
</cc_config>
[Mar 16, 2011 3:08:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

Hello XS_fallwind,
It sounds as though you are gathering a lot of practical experience running BOINC on a RamDisk. I hope that someday you will add a post about your experience on the RAMDisk thread ( https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,30244 ).

applause
Lawrence
[Mar 16, 2011 5:28:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

@knreed: Thanks for your input. In my case, I don't think the WUs that died were swapped out, so LAIM is irrelevant. It was often not just the 1 CEP2 WU that died, but 2 of the 3 FAAH ones, and often the CEP2 survived. In Task Mangler Processes, what usually happened is that when the HDD LED came on & the system froze, the CEP2 went to low/zero %CPU, but the FAAHs stayed at 25% for a little while and then they too dropped to 0% one at a time. The FAAHs may have been continuing until they needed to do disk I/O, encountered the hangup, timed out & were killed. I didn't set the cpu-logging flags to confirm.
It's been several days since I changed the BOINC data HDD and there have been no WU crashes since. I am reluctant to put the previous HDD back on line.

Did you see my enquiry/suggestion in work units not finishing re introducing short delays in the "VERY IO intensive" part(s) of CEP2 to allow slow HDDs to catch up in an attempt to alleviate the timeouts problem?
I've also noticed on a machine that has never had CEP2 cause a timeout that it is very slow at opening extra programs while CEP2 is doing this I/O, so diluting the intensity of the I/O would probably also help reduce the impact of running CEP2 on machines that are being used for tasks other than WCG, over a much wider range than just those with slow HDDs.
The overall gain in system performance and perhaps in cruncher numbers should more than compensate for the few tens of seconds increase in wall-clock time for CEP2 WUs.

I don't know how you'd implement the delays to suit all machines. Are there any spare "User" parameters in cc_config.xml that you could use to make the delays tunable?

[I have no self-interst in promoting WD, but the WD Caviar Green 500GB WD5000AADS drive is still a current product and is very cheap, if you want to experiment in-house. Its stated Idle power consumption of only 2.18W is incredible for a mechanical HDD, but the downside is poor performance for demanding applications.]
HTH - Rick
[Mar 18, 2011 3:51:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Will CEP2 units eventually NOT Bogart the host when they start up?

Rickjb: Replacing the drive by an older, slower one that has a smaller onboard cache has solved the problem.

Cleanenergy: That is a very odd and counter-intuitive behavior.
Just to add a bit to the theoretical understanding here:
According to Microsoft engineering there is actually a known issue with disks that cache too much (under heading "Random Writes & Flushes: Your mileage will vary greatly"):
On occasion, we’ll see HDDs struggle with bursts of random writes and flushes. Drives that cache too much for too long and then get caught with too much of a backlog of work to complete when a flush comes along, have proven to be problematic.
[...]
We’ve seen some devices [...] take 10’s of seconds to return to a more consistently responsive state. For the user, this can be awful to endure as responsiveness drops to painful levels.

----------------------------------------
[Edit 1 times, last edit by Former Member at Apr 7, 2011 12:36:56 PM]
[Apr 7, 2011 12:25:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 38   Pages: 4   [ Previous Page | 1 2 3 4 ]
[ Jump to Last Post ]
Post new Thread