World Community Grid - View Thread - ARP1-task failing . . .

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: ARP1-task failing . . .

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 13

[ ]

Author

This topic has been viewed 4040 times and has 12 replies

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


ARP1-task failing . . .

Is this error caused due to lack of memory after >9 hours run time?
I see in the result log 'malloc', what could mean memory allocation.

Result Name: ARP1_ 0034761_ 000_ 0--

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The storage control block address is invalid.
(0x9) - exit code 9 (0x9)</message>
<stderr_txt>
INFO: Initializing
INFO: No state to restore. Start from the beginning.
Starting WRFMain
[10:14:56] INFO: Checkpoint taken at 2018-07-01_06:00:00
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 1
: Not enough space
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 2
: Not enough space
rsl_malloc failed allocating 24911668 bytes, called ..\external\RSL_LITE\rsl_bcast.c, line 270, try 3
: Not enough space

</stderr_txt>

[Dec 7, 2019 9:59:54 AM]

adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2346
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

90 day badge for Nutritious Rice for the World

14 day badge for Discovering Dengue Drugs - Together - Phase 2

180 day badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

2 year badge for FightAIDS@Home - Phase 2


Re: ARP1-task failing . . .

If we believe the error-messages, Crystal Pellet, and there is no indication that we shouldn't be believing them, then we see that the function rsl_malloc tried three times to allocate almost 25 MB of memory and failed in all three cases. The error-message-string can be found in the executable (on my Linux-device in a ELF 64-bit LSB executable):

# strings wcgrid_arp1_wrf_7.27_x86_64-pc-linux-gnu | grep rsl_malloc
rsl_malloc failed allocating %d bytes, called %s, line %d, try %d

The obvious reason for the error-messages comes from the operating system: Not enough space. Since the storage-area in the computer's memory can't be allocated, there is no valid computer-address associated with it, which explains this message: "The storage control block address is invalid".

[Dec 7, 2019 2:54:28 PM]

nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:

10 year badge for Help Fight Childhood Cancer

5 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

5 year badge for Computing for Sustainable Water

20 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Microbiome Immunity Project

20 year badge for OpenPandemics - COVID-19


Re: ARP1-task failing . . .

Looks like not enough swap space.

----------------------------------------

In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.

[Dec 7, 2019 4:13:16 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: ARP1-task failing . . .

Try allocating more disc space in device profiles.

Mike

[Dec 10, 2019 2:41:10 AM]

hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 865
Status: Offline
Project Badges:

45 day badge for Help Cure Muscular Dystrophy

1 year badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

2 year badge for Africa Rainfall Project


Re: ARP1-task failing . . .

@Crystal Pellet, how much RAM does that machine have, and how many CPU cores/threads?

----------------------------------------

i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Dec 10, 2019 4:06:58 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: ARP1-task failing . . .

Try allocating more disc space in device profiles.

Mike

BOINC is designed to pause a task(s) when it does not have enough memory allocated "waiting for memory" i.e. nothing to do with the BOINC profiles.

[Dec 10, 2019 7:57:48 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:


Re: ARP1-task failing . . .

@Crystal Pellet, how much RAM does that machine have, and how many CPU cores/threads?

It's a Windows VM with 30 cores only used for BOINC on an Opteron Linux server.

VM1: 10 Dec 09:40:47 max memory usage when idle: 23039.55 MB

The max for ARP1 is set to 10 tasks and else MCM1.
I noticed that the ARP1's normally are using about 700-750MB of RAM,
but during very short periods the RAM usage can grow to 1012MB.

On that VM I also had running ECM's (elliptic-curve factorization method) and those tasks are a bit tricky with the memory usage.
During ~70% of the run they use almost no memory and during the last part the RAM goes sky high.
Depending on the type it can go up to 1800MB.
Therefore I run them staggered, so that only a few tasks at the same time use the higher needed memory.
That must been the reason for the lack of memory somehow causing the ARP-failure.

[Dec 10, 2019 9:03:23 AM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:


Re: ARP1-task failing . . .

A VM will utilise some capacity, but not too much. Are you limiting arp1 in device profiles or app_config? Try limiting arp1 to 10 in Device Profiles and to 5 in app_config and if that cures the problem, slowly increase. That way you will hold a cache of 10 units and only run 5 at a time. Not very scientific but we have a long way to go at present rates, so we can take our time finding the best combinations.

The project will not suffer from your temporary lower throughput as there are lots of machines under-utilised.

Mike

----------------------------------------
[Edit 1 times, last edit by Mike.Gibson at Dec 10, 2019 1:31:41 PM]

[Dec 10, 2019 1:23:10 PM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1403
Status: Offline
Project Badges:


Re: ARP1-task failing . . .

I've limited the ARP1 to 10 in a device profile just because of their RAM-hungriness. I don't want to create a buffer of ARP1's, cause I want them return as soon as possible to stay 'reliable' for ARP1 (return within 2.5 days nowadays). The ARP1- runtimes on that VM are from 47 to 52 hours, so when I see a new one has arrived, I'm push it to running state by suspending a MCM1.

[Dec 10, 2019 2:55:11 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12594
Status: Offline
Project Badges:


Re: ARP1-task failing . . .

10 in Device profiles and 5 in app_config would give you a cache of 2 per thread. If you are taking less than 24 hours per unit then they will all be returned within the 2.5 days. You can then slowly increase the 5 until you start getting the problem again. Should only take a few days to find the optimum.

Mike

[Dec 10, 2019 3:06:02 PM]

[ ]