Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: CEP2 AMD 4P Optimization |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 58
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
HI:
im all about doing as much work on this project as possible, with the hardware i have (this winter ... seasonal cruncher am I) the only other crunching i have done is for F@H so ... im trying to learn the ways of WCG i have 4 dedicated crunching boxes 2x 4P 6174s (48 integer cores and 48 FPO cores each) and 2x 4P 6276s (64 integer cores and 32 FPO cores, each) 3 of them have 32 GiGs of ram and 1 has 64 Gigs All of them are running Ubuntu 14.04.3 LTS all of them have a GPU that is in use by GPUGrid ... so they are down-cored by one to service the GPU ... on the 48 cores this means it looses one integer core and FPO on the 64 cores it looses one integer core but i dont know if it looses a FPO, as 2 integer cores share one FPO ... i have no idea if this makes a difference or not. i dont know .... if this software runs as many threads as you have FPOs or integer cores or how much ram it needs so that it wont use the slower swap file or how many threads i can run simultaneously and still make the deadline how much HD will i need for that .... how significant HD Speed is thats all the questions i can think of at the moment can anyone give me some insight as to how CEP2 uses the [H]ardware and maybe some tips on how to maximize the [H]ardware Thanks |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Let me change one thing ... i said above "with the hardware i have"
I might make some small changes in hardware if it mades a big difference in productivity for example ... if the "work unit threads" run on individual FPOs ... well 6174s have 48 while 6276s have 32 .... so i can run more 16 more (-1 for GPU) on the 6174s ... (yeah i know 6386s ... but they are out side of the budget) or if the program needs 1 gig of ram per core to not use the swap file i might move some RAM chips around or upgrade some or if HD speed makes a big difference i might upgrade to some small SSDs or maybe a different OS is better some stuff like that ... small tweek, big payoff ill listen to any suggestion ... except switching teams |
||
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
The overarching problem is, key reason why default only one is issued to a host, that concurrent (re)starting brings most any system to it's knees. Many micromanage, some even have developed script to achieve staggered starting [which BOINC itself has no facility for].
|
||
|
OldChap
Veteran Cruncher UK Joined: Jun 5, 2009 Post Count: 978 Status: Offline Project Badges: |
CEP2 is rather write intensive so If I were building a rig specifically the crunch this I would be thinking in terms of running a significant sized ram disk. I would also be considering an ssd that is capable of big numbers over its lifetime.
----------------------------------------Ubuntu is fine No real knowledge of AMD to add anything useful |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
OK OldChap ... Ram disk + SSD ... got it
SekeRob* ... "overarching" ... havent heard of it ... School me .... plz also what i have found is the client doesnt see more than 32 cores ... and there is some *.xml file that has to be edited ... loosing 96 cores across my little server garden just because of that ... its got my attention |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
well SekeRob* .... i have a BUNCH of work Units that have "computational errors" in under 3 minutes
but also i have 19 running simultaneously ... on 32 cores ... so ... this server is not making enough Heat to make me happy |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7545 Status: Offline Project Badges: |
I would be pretty sure the reason you got the errors is those WU (work units) choked on inadequate I/O to your disk or SSD. What SekeRob* is saying is that at the beginning of the WU there is ALOT of I/O which, given enough units at the same starting stage, will overwhelm even the fastest systems because the channels just can not carry that data at a time. If your cpu's are starved for data too long they will throw an error. His suggestion of staggered starting times has worked for some, but keeping the starting times staggered on a 64 cpu system is going to be tough due to the varying lengths of the WU's.
----------------------------------------Good luck Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Byteball_730a2960
Senior Cruncher Joined: Oct 29, 2010 Post Count: 318 Status: Offline Project Badges: |
CEP2 workunits, have long intervals between checkpoints.
So, if you are running the boxes 24/7, you'll have an initial need to micromange the startup to reduce I/O, but after some time, the read/writes will settle down to a manageable level (I think) as they'll eventually start staggering. If you shut down to computers often, with this many workunits will lead to a lot of lost computing time and constantly managing I/O. I highly recommend a small SSD. My two big systems are not dedicated, but I bought a 32gb SSD for each to keep WCG separate from my OS drive. As for the cores being limited, I had to add a cc_config like such <?xml version="1.0"?> -<cc_config> -<options> <ncpus>48</ncpus> </options> </cc_config> have a search online and you'll find a better explanation for your needs |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
ok, i think i am getting the picture ... i cant have systems dedicated to CEP2 only
even tough they are dedicated crunching boxes 24/7 with no reboots. and that SSDs will help Alot. ... i havent done the cc_config.xml thing yet ... But ...the line ... <ncpus>48</ncpus> ... would be for a 48 core system ... i read somewhere, that if you put in ... <ncpus>-1</ncpus> ... it will "adjust" to how many physical cores you have. im going to try that ... also wondering about how this stuff runs on integer cores VS FPO cores ... i am going to guess its about FPOs and am wondering .... on a 64 core system that has 2 integer cores per double bandwith FPO ... if there is some way to assign one FPO per task and get them done (at least) twice as fast as having that FPO being tasked by 2 different integer cores? ... anyone got some insight on this? and thanks all for the insights |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Since you've got plenty of RAM memory and your installed OS is Linux based I suggest you to mount BOINC data folder in RAM for bypassing I/O problem: doing so I've managed to run smoothly and without errors 7 CEP2 WUs plus 1 UGM WU at a time on a notebook equipped with 8 thread processor, 8 GB RAM and 1 GB swap partition. Otherwise I couldn't run more than one CEP2 WU simultaneously since my HDD isn't responsive enough and the system completely hung for a couple of minutes when WUs were starting simultaneously.
----------------------------------------Each CEP2 WU needs 170 MB RAM to run plus 700 MB more for mounting the folder containing its thousands files. The downside being you'll lose all the work in progress whenever you reboot the system or a blackout occours since BOINC data folder is stored in RAM instead of HDD. But if you're crunching 24/7 it's not your case. Surely there are some ways to set up a periodical backup system and to be able to suspend or ibernate the system without losing anything but I'm not interested in them so I can't advise you about that. In Linux Mint I've used the following code in etc\fstab file to mount BOINC data folder in RAM: tmpfs /var/lib/boinc-client/slots tmpfs defaults,mode=1777,noatime,size=7G 0 0 If some WUs error out and the following error message appears in their log files in Results Status page of your account at WCG site you have to increase the size of tmpfs or the size of swap partition or swap file. forrtl: No space left on device forrtl: severe (38): error during write, unit 48, file [...] [Edit 2 times, last edit by Former Member at Dec 15, 2015 7:55:58 AM] |
||
|
|