| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 24
|
|
| Author |
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hello MarkHNC,
Here is my unsolicited opinion on unknown science application errors. Right now, DDDT2 is stalled while the scientists try to overcome problems running the program over West Nile encephalitis. These problems can occur when an algorithm suddenly finds itself in an impossible situation while processing a particular molecule. For example, the algorithm may not converge to a single value. This is not necessarily because of a programming bug. It can happen because the algorithm cannot handle a particular molecular configuration. For whatever reason, almost all quantum chemistry programs will fail over some ranges of data. I just shrug and let the project scientists worry about this. On some projects, such as HCMD2, we end up rerunning some molecules with different parameters. (Which is what we have been doing for several months. Almost through!!) Lawrence |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7849 Status: Offline Project Badges:
|
I presume you have a quad core system because you stated you are now running only 3 CEP2 to to one FAAH. Has this allowed CEP2 to behave over time ? If so, it seems to point to a potential bottleneck someplace in your system. You have stated you use only high end components and have otherwise maximally stressed your system with other applications and experienced no errors. I am at a loss for where that might be. It also does not explain why you were able to run for an extensive period of time with CEP2 without a problem. The only other item which comes to mind is a transient problem with your power supply, but given your care with components, that is probably unlikely. Other than that I am at a deadend for any further suggestions but would be curious to know if you ever do figure out what the cause is. Good luck.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
When I don my programmer's hat, I often require of it much more than CEP2 does. One of my big data projects was converting a 10GB database into a new one, record by record, converting some field formats as I went, depending on record contents. That process asked way more of the drive than CEP2 is asking, with the drive running at 100% load (according to Resource Monitor) for a few hours at the time. Testing indicates that the drive is easily capable of writing fast enough to save the entire memory footprint of CEP2 in a matter of a few seconds, and it is even faster at reading. For example, when doing video editing, this PC can transcode a 4GB NTSC video file from one format to another in four minutes or less. I am wondering if it is due to the difference between serial and random writes? The serial writes are easier on the drive, whether it is mechanical or SSD. But the random writes slow things down, especially if it requires a lot of head movement on a mechanical drive. I expect that the real-time operation of CEP2 requires more random writes than does a database, but have not tested that myself. If you want to have some fun, try out the free trial (180 days) of FancyCache. Even if it does not solve this problem, it will speed things up if you use write-caching and set a long write-delay of at least a few minutes (I usually used 24 hours). http://www.romexsoftware.com/en-us/fancy-cache/index.html Note that the "Volume" version is for caching a single partition, which is normally what I use. The "Disk" version can cache one or more entire drives. But the caveats about backup power supplies still apply, since you lose anything in the write cache during a power outage or crash. However, if you set the write cache to only 10 minutes or so, you still get a worthwhile reduction in writes, while not losing very much in case of a crash. (Note that in addition to reducing the number of writes to the drive for CEP2, it also serializes them, which makes it easier on the drive.) [Edit 1 times, last edit by Jim1348 at Nov 10, 2012 5:07:04 PM] |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Part of why I choose enterprise class Western Digital 10k RPM Velociraptors as my primary workstation hard drives is their superior random access speeds. In testing, these drives outperform all of my 7200 RPM drives. You have to move from SATA to SAS to get better performance. The more recent models I am using also feature TLER, so the possible bottleneck of error recovery is mitigated. (These drives also feature an "ice pack," a 3.5" heat sink surrounding the 2.5" drive, which means they run at the same or less temps as my 7200 RPM drives given the same ambient temp, and mount in a 3.5" bay.)
I already have write caching enabled these drives, because each of my PC's has a battery backup. |
||
|
|
Jim1348
Veteran Cruncher USA Joined: Jul 13, 2009 Post Count: 1066 Status: Offline Project Badges:
|
I already have write caching enabled these drives, because each of my PC's has a battery backup. That enables the internal cache on the drive (e.g., 32 MB or whatever), and will help the data get to the drive if the drive is busy doing other things when the OS wants to write to it. It will bridge the gap for a few seconds, depending on how fast you are sending data, and might be good enough to eliminate the above-noted problem in your case, since you have very fast drives to begin with. But CEP2 writes a LOT of data, depending on how many cores it is running on at a time. A caching program like FancyCache makes use of your main memory, which allows you to set the cache size to a GB or more; I think I used 4 GB when setting the write delay to 24 hours, but that is on the large size and not necessary with shorter write delays. And the write cache in FancyCache is also a Read cache, so that any data still in the cache is available for reading from main memory rather than having to access it from the drive. That cuts down a lot on head movement, and reduces delays in getting the data out too. In fact, when you set the write delay to more than about 12 hours or so, you are in effect creating a Ramdisk, and all the data in a work unit is written to, and read from main memory rather than the disk. It may not be needed in your case, but it would definitely help on slower drives. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The failures of CEP2 are not occurring during periods of high drive activity, which is why I don't think drive throughput is the problem. There is high drive activity as each CEP2 workunit starts (the first 3 or 4 seconds), but the activity quickly drops to nominal. It's between 10 and 15 seconds into the run that CEP2 fails.
In any event, if Win7's Resource Monitor is to be trusted, the rate at which data is being written comes no where near the known top write speed of this drive, even on random writes, and that high drive activity has moderated well before CEP2 fails. If all four CEP2 processes were starting or doing heavy writes at the same time, then it would seem more plausible that it was reaching the the limits of the drive. I have been running this PC since I joined in May of this year with "Run always" selected on the Activity menu and the preferences set to use 100% of processors and 100% of CPU time while processor usage is less than 50%. If I know I'm going to push the limit often on the 50% threshold (e.g. video editing), I simply suspend BOINC. In all other respects, I am very proud of the efficiency/low overhead on this PC. For example, the last validated CEP2 workunit from this afternoon (while the PC did not have an active user) shows 5.16 hours CPU time out of 5.19 Elapsed hours time. At the time, it was running alongside another CEP2 and two FAAH workunits. I have changed "Run always" to "Run based on preferences," to see if that might stabilize the situation. I have also kept FAAH in the rotation, and set CEP2 to max at 3 workunits at the time. It makes it impossible for all four cores to run CEP2 at the same time. I must confess some bewilderment at the lack of a useful error code/log. As a Windows desktop app developer, I always write error events to the log (and optionally other events), so that there is at least some information that can support troubleshooting. This often gives me valuable information to convince a user/client that they have a bad drive or a flaky network connection. I'm going to let it run this way for a while and see how it behaves. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I just shrug and let the project scientists worry about this. Thanks lawrence. I somehow missed your encouraging post. Much appreciated. |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It also does not explain why you were able to run for an extensive period of time with CEP2 without a problem. Agreed, Sgt Joe. Doesn't make sense to me either. It would run 24 to 36 hours without a problem, and then suddenly start throwing errors. We get very high quality/stable power from the electricity provider in our area, and I have good battery backup on that PC, which conditions the power. All my PC's also have high efficiency power supplies that are rated for at least twice the wattage that they require, so total power requirements shouldn't be an issue. With heavy primary drive activity, all four cores at 100% and the LED/LCD monitor "awake," this rig maxes around 150W draw on a 900W battery backup. As to running 3 CEP2 to 1 FAAH, I can only manage that if I micromanage the queue. I have BOINC on that PC set for half a day's cache. Since CEP2 is currently limited to 3 workunits at the time, it keeps a much larger cache of FAAH than CEP2, so there are times when only FAAH workunits are running at the time. Wish I could tell it to do 3 to 1 . . . |
||
|
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I do think I'd at least consider loading the combination of eFMer BoincTasks and TThrottle to give yourself the ability to monitor and log processor core temperatures and throttle down on processor overheats. You could then firmly state that your core temperature data rules out overheating as an issue rather than rely upon whatever assumptions you are making about ambient temperatures.
You'd be surprised at how quickly environmental conditions can change where cooling is involved...as when your cleaning service, significant other, or mother does you the great good favor of dosing carpets with a substance like Carpet Fresh - or construction upwind of your A/C fresh air intakes blesses you with some grade A rock or cement dust. |
||
|
|
CandymanWCG
Senior Cruncher Romania Joined: Dec 20, 2010 Post Count: 421 Status: Offline Project Badges:
|
I do think I'd at least consider loading the combination of eFMer BoincTasks and TThrottle to give yourself the ability to monitor and log processor core temperatures and throttle down on processor overheats. You could then firmly state that your core temperature data rules out overheating as an issue rather than rely upon whatever assumptions you are making about ambient temperatures. You'd be surprised at how quickly environmental conditions can change where cooling is involved...as when your cleaning service, significant other, or mother does you the great good favor of dosing carpets with a substance like Carpet Fresh - or construction upwind of your A/C fresh air intakes blesses you with some grade A rock or cement dust. ...or you could do the recommended setting for this project and choose to run it on just 50% of your CPU's cores (which, by coincidence or not, was the same as your old setting of running only 2 units at a time). My two cents on the matter are these: even if the error you see is because of the huge amount of I/O on whatever type of storage you have and assuming you will somehow get that fixed, I would bet it won't be long before you will start seeing another nice error: no heartbeat from client for more than x seconds...exiting. And that, my friend, is the CPU getting clogged up. So my advice: go back to running it on 2 cores. If you have other devices that run other sciences, maybe you can add a unit or two there, to compensate for the other 2 cores that you won't be running CEP2 on. Oh, and by the way, the fact that BOINC refused to give you anymore units is just a fail safe put in place for exactly this type of situations where a machine spits out only bad results. You would put a leash on it too, right? But no worries: as soon as you get it fixed, your daily quota will increase little by little and since the maximum is not that far, you shouldn't even notice when you will be back to normal. Hope this helps. Cheers! Knowledge is limited. Imagination encircles the world! - Albert Einstein ![]() [Edit 1 times, last edit by CandymanWCG at Nov 14, 2012 11:02:45 PM] |
||
|
|
|