| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 38
|
|
| Author |
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
@ [B.S] sTrey: My similar problem was caused by the model of the HDD on which I had the BOINC data. Replacing the drive by an older, slower one that has a smaller onboard cache has solved the problem. I can now run 4 simultaneous CEP2 WUs reasonably happily with 2GB RAM under XP-32. With 4 CEP2s there is much activity on the BOINC data drive, and almost none detectable on the drive with the system & pagefile, and there are many brief instances where CPU usage drops during the HDD activity. No WUs have been "Bogart"ed since I changed the drive. I prefer to run 1 CEP2 WU and 3 WUs from other projects, because much less CPU time is lost by the dropouts.
More details in my posts in work units not finishing . |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Dear Rickjb,
That is a very odd and counter-intuitive behavior. Just for curiosity - how was the previous hd connected? Is it a regular SATA? Any general compatibility issues with the motherboard or card? It was not an external drive, right? Best wishes Your Harvard CEP team |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
No compatibility issues with motherboard: Asus P5Q Deluxe - LGA775, with Intel P45 chipset including ICH10 Southbridge, using the ICH10 SATA controller in either IDE or AHCI Mode. The WD "Caviar Green" drive is a regular SATA II unit. It is meant to be an "internal" drive, but it's actually "external" because it & the rest of the machine are sitting out on a table.
The WD drive seemed to freeze the system for longer and to a much greater extent than the IDE units. For example, with the WD, the menu bar of the BOINC Manager window would white out when I tried to access it, but this window remains quite responsive during periods of intense HDD activity with the IDE drives. I think the WD drive was also preventing the WCG science apps from running enough to send their heartbeat signals back to BOINC, so they timed out & were killed. I'd need more expertise and diagnostic tools to analyse what is happening at the detailed hardware level. I just know that 2 different IDE drives with smaller caches have solved the problem. (The "other" IDE drive was the current 8.4GB Seagate system drive, which fixed the problem when it held the BOINC data to run DDDT2, but which is too small for CEP2). Please see my posts in Legrandpiou's work units not finishing thread. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Just an update, I setup one Intel 980x machine with a large RAMdisk and placed the BOINC data folder on it and it has been happily crunching 12 threads of CEP2 24/7 for the past couple months. CPU efficiency is greater than 99%. The OS(Server 2003 x64), including pagefile, is on a traditional 7200RPM SATA HD.
The catch is I needed 24GB of DDR3 to do it. The minimum size needed for the RAMdisk is about 13GB for 12 threads to operate properly. I have mine set at 16GB. Also, most descent RAMdisk software is not free and running one causes lots of delay when starting up and shutting down the machine. There's also the risk of losing the entire contents of the RAMdisk should the machine crash or be shut down unexpectedly. |
||
|
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges:
|
The start-up of a CEP2 workunit involves unzipping a bunch of files. This is a VERY IO intensive process. On some machines under some conditions, this will likely cause the research application to be non-responsive to heartbeat messages. I am assuming that BOINC decides that CEP2 isn't responding and thus kills it. If the IO is still heavy and the other applications that are loaded into memory (LAIM) have been swapped out, then we are going to see the same effect there because the apps are unable to get loaded back in memory before they are killed for being unresponsive.
To test this, can you do the following: Create the cc_config.xml file outlined here: http://boinc.berkeley.edu/wiki/Client_configuration Enable the following flags: <cpu_sched> <cpu_sched_debug> It will look like: <cc_config> <log_flags> <cpu_sched>1</cpu_sched> <cpu_sched_debug>1</cpu_sched_debug> </log_flags> <options> </options> </cc_config> |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello XS_fallwind,
It sounds as though you are gathering a lot of practical experience running BOINC on a RamDisk. I hope that someday you will add a post about your experience on the RAMDisk thread ( https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,30244 ). Lawrence |
||
|
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges:
|
@knreed: Thanks for your input. In my case, I don't think the WUs that died were swapped out, so LAIM is irrelevant. It was often not just the 1 CEP2 WU that died, but 2 of the 3 FAAH ones, and often the CEP2 survived. In Task Mangler Processes, what usually happened is that when the HDD LED came on & the system froze, the CEP2 went to low/zero %CPU, but the FAAHs stayed at 25% for a little while and then they too dropped to 0% one at a time. The FAAHs may have been continuing until they needed to do disk I/O, encountered the hangup, timed out & were killed. I didn't set the cpu-logging flags to confirm.
It's been several days since I changed the BOINC data HDD and there have been no WU crashes since. I am reluctant to put the previous HDD back on line. Did you see my enquiry/suggestion in work units not finishing re introducing short delays in the "VERY IO intensive" part(s) of CEP2 to allow slow HDDs to catch up in an attempt to alleviate the timeouts problem? I've also noticed on a machine that has never had CEP2 cause a timeout that it is very slow at opening extra programs while CEP2 is doing this I/O, so diluting the intensity of the I/O would probably also help reduce the impact of running CEP2 on machines that are being used for tasks other than WCG, over a much wider range than just those with slow HDDs. The overall gain in system performance and perhaps in cruncher numbers should more than compensate for the few tens of seconds increase in wall-clock time for CEP2 WUs. I don't know how you'd implement the delays to suit all machines. Are there any spare "User" parameters in cc_config.xml that you could use to make the delays tunable? [I have no self-interst in promoting WD, but the WD Caviar Green 500GB WD5000AADS drive is still a current product and is very cheap, if you want to experiment in-house. Its stated Idle power consumption of only 2.18W is incredible for a mechanical HDD, but the downside is poor performance for demanding applications.] HTH - Rick |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Rickjb: Replacing the drive by an older, slower one that has a smaller onboard cache has solved the problem. Just to add a bit to the theoretical understanding here: Cleanenergy: That is a very odd and counter-intuitive behavior. According to Microsoft engineering there is actually a known issue with disks that cache too much (under heading "Random Writes & Flushes: Your mileage will vary greatly"): On occasion, we’ll see HDDs struggle with bursts of random writes and flushes. Drives that cache too much for too long and then get caught with too much of a backlog of work to complete when a flush comes along, have proven to be problematic. [...] We’ve seen some devices [...] take 10’s of seconds to return to a more consistently responsive state. For the user, this can be awful to endure as responsiveness drops to painful levels. [Edit 1 times, last edit by Former Member at Apr 7, 2011 12:36:56 PM] |
||
|
|
|