World Community Grid - View Thread - CEP2 AMD 4P Optimization

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: CEP2 AMD 4P Optimization

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 58

[ ]

Author

This topic has been viewed 9885 times and has 57 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


CEP2 AMD 4P Optimization

HI:
im all about doing as much work on this project as possible, with the hardware i have
(this winter ... seasonal cruncher am I)

the only other crunching i have done is for F@H so ... im trying to learn the ways of WCG

i have 4 dedicated crunching boxes

2x 4P 6174s (48 integer cores and 48 FPO cores each) and
2x 4P 6276s (64 integer cores and 32 FPO cores, each)

3 of them have 32 GiGs of ram and 1 has 64 Gigs

All of them are running Ubuntu 14.04.3 LTS

all of them have a GPU that is in use by GPUGrid ...

so they are down-cored by one to service the GPU ...
on the 48 cores this means it looses one integer core and FPO
on the 64 cores it looses one integer core but i dont know if it looses a FPO,
as 2 integer cores share one FPO ...
i have no idea if this makes a difference or not.

i dont know ....
if this software runs as many threads as you have FPOs or integer cores
or how much ram it needs so that it wont use the slower swap file
or how many threads i can run simultaneously and still make the deadline
how much HD will i need for that ....
how significant HD Speed is

thats all the questions i can think of at the moment

can anyone give me some insight as to how CEP2 uses the [H]ardware and maybe some tips on how to maximize the [H]ardware

Thanks

[Dec 3, 2015 6:42:30 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2 AMD 4P Optimization

Let me change one thing ... i said above "with the hardware i have"

I might make some small changes in hardware if it mades a big difference
in productivity

for example ... if the "work unit threads" run on individual FPOs ... well
6174s have 48 while 6276s have 32 .... so i can run more 16 more (-1 for GPU) on the 6174s ...

(yeah i know 6386s ... but they are out side of the budget)

or if the program needs 1 gig of ram per core to not use the swap file
i might move some RAM chips around or upgrade some

or if HD speed makes a big difference i might upgrade to some small SSDs

or maybe a different OS is better

some stuff like that ... small tweek, big payoff

ill listen to any suggestion ... except switching teams

smile

[Dec 3, 2015 7:37:43 PM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: CEP2 AMD 4P Optimization

The overarching problem is, key reason why default only one is issued to a host, that concurrent (re)starting brings most any system to it's knees. Many micromanage, some even have developed script to achieve staggered starting [which BOINC itself has no facility for].

[Dec 3, 2015 8:08:52 PM]

OldChap
Veteran Cruncher
UK
Joined: Jun 5, 2009
Post Count: 978
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

5 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

100 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: CEP2 AMD 4P Optimization

CEP2 is rather write intensive so If I were building a rig specifically the crunch this I would be thinking in terms of running a significant sized ram disk. I would also be considering an ssd that is capable of big numbers over its lifetime.

Ubuntu is fine

No real knowledge of AMD to add anything useful

----------------------------------------

[Dec 3, 2015 8:19:27 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2 AMD 4P Optimization

OK OldChap ... Ram disk + SSD ... got it

SekeRob* ... "overarching" ... havent heard of it ... School me .... plz biggrin

also what i have found is the client doesnt see more than 32 cores ... and there is some *.xml file that has to be edited ... loosing 96 cores across my little server garden just because of that ...

its got my attention
raised eyebrow

[Dec 3, 2015 10:52:24 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2 AMD 4P Optimization

well SekeRob* .... i have a BUNCH of work Units that have "computational errors" in under 3 minutes
but also i have 19 running simultaneously ... on 32 cores ...
so ...
this server is not making enough Heat to make me happy
wink

[Dec 3, 2015 11:11:31 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7846
Status: Offline
Project Badges:

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: CEP2 AMD 4P Optimization

I would be pretty sure the reason you got the errors is those WU (work units) choked on inadequate I/O to your disk or SSD. What SekeRob* is saying is that at the beginning of the WU there is ALOT of I/O which, given enough units at the same starting stage, will overwhelm even the fastest systems because the channels just can not carry that data at a time. If your cpu's are starved for data too long they will throw an error. His suggestion of staggered starting times has worked for some, but keeping the starting times staggered on a 64 cpu system is going to be tough due to the varying lengths of the WU's.
Good luck
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 4, 2015 1:24:16 AM]

Byteball_730a2960
Senior Cruncher
Joined: Oct 29, 2010
Post Count: 318
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

5 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

50 year badge for Uncovering Genome Mysteries

100 year badge for FightAIDS@Home - Phase 2

200 year badge for Smash Childhood Cancer

20 year badge for Africa Rainfall Project

200 year badge for OpenPandemics - COVID-19


Re: CEP2 AMD 4P Optimization

CEP2 workunits, have long intervals between checkpoints.
So, if you are running the boxes 24/7, you'll have an initial need to micromange the startup to reduce I/O, but after some time, the read/writes will settle down to a manageable level (I think) as they'll eventually start staggering.
If you shut down to computers often, with this many workunits will lead to a lot of lost computing time and constantly managing I/O.

I highly recommend a small SSD. My two big systems are not dedicated, but I bought a 32gb SSD for each to keep WCG separate from my OS drive.

As for the cores being limited, I had to add a cc_config like such

<?xml version="1.0"?>

-<cc_config>

-<options>

<ncpus>48</ncpus>

</options>

</cc_config>

have a search online and you'll find a better explanation for your needs

[Dec 4, 2015 2:25:48 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2 AMD 4P Optimization

ok, i think i am getting the picture ... i cant have systems dedicated to CEP2 only
even tough they are dedicated crunching boxes 24/7 with no reboots. and that SSDs will help Alot. ... i havent done the cc_config.xml thing yet ... But ...the line ... <ncpus>48</ncpus> ... would be for a 48 core system ... i read somewhere, that if you put in ... <ncpus>-1</ncpus> ... it will "adjust" to how many physical cores you have. im going to try that ... also wondering about how this stuff runs on integer cores VS FPO cores ... i am going to guess its about FPOs and am wondering .... on a 64 core system that has 2 integer cores per double bandwith FPO ... if there is some way to assign one FPO per task and get them done (at least) twice as fast as having that FPO being tasked by 2 different integer cores? ... anyone got some insight on this?
thinking

and thanks all for the insights
biggrin

[Dec 4, 2015 9:34:43 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: CEP2 AMD 4P Optimization

Since you've got plenty of RAM memory and your installed OS is Linux based I suggest you to mount BOINC data folder in RAM for bypassing I/O problem: doing so I've managed to run smoothly and without errors 7 CEP2 WUs plus 1 UGM WU at a time on a notebook equipped with 8 thread processor, 8 GB RAM and 1 GB swap partition. Otherwise I couldn't run more than one CEP2 WU simultaneously since my HDD isn't responsive enough and the system completely hung for a couple of minutes when WUs were starting simultaneously.

Each CEP2 WU needs 170 MB RAM to run plus 700 MB more for mounting the folder containing its thousands files. The downside being you'll lose all the work in progress whenever you reboot the system or a blackout occours since BOINC data folder is stored in RAM instead of HDD. But if you're crunching 24/7 it's not your case.
Surely there are some ways to set up a periodical backup system and to be able to suspend or ibernate the system without losing anything but I'm not interested in them so I can't advise you about that.

In Linux Mint I've used the following code in etc\fstab file to mount BOINC data folder in RAM:

tmpfs /var/lib/boinc-client/slots tmpfs defaults,mode=1777,noatime,size=7G 0 0

If some WUs error out and the following error message appears in their log files in Results Status page of your account at WCG site you have to increase the size of tmpfs or the size of swap partition or swap file.

forrtl: No space left on device forrtl: severe (38): error during write, unit 48, file [...]

----------------------------------------
[Edit 2 times, last edit by Former Member at Dec 15, 2015 7:55:58 AM]

[Dec 4, 2015 9:44:33 AM]

[ ]