World Community Grid - View Thread - Some HPF2 workunits aborting

World Community Grid Forums

Category: Retired Forums

Forum: Known Issues [read only]

Thread: Some HPF2 workunits aborting

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 5

[ ]

Author

This topic has been viewed 2757 times and has 4 replies

Viktors
Former World Community Grid Tech
Joined: Sep 20, 2004
Post Count: 653
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

180 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

180 day badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

180 day badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Some HPF2 workunits aborting

We have noticed that on Boinc and UD agents, a few of the HPF2 work units are aborting early. We are investigating the cause of this. No action on the part of members is required. Thanks for your patience.

[Jun 26, 2006 6:51:17 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Some HPF2 workunits aborting

We have released version 5.07 of the code for the Human Proteome Folding - Phase 2 project on BOINC. This version should resolve a number of the problems members have been experiencing. In particular it should significantly reduce the occurance of the exit code 1282 and and exit code -1073741819 errors.

BOINC users will automatically recieve the new version when the client connects to the server to download new workunits. If you wish to get the new version of the application immediately, then you can reset the project (open the BOINC Manager, go to the Projects tab, select 'World Community Grid', and then click on the 'Reset Project' button).

We apologize for these problems and appreciate your patience and support while we resolve them.

[Jun 28, 2006 1:46:41 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: Some HPF2 workunits aborting

This release is working as expected. We have recieved 4330 results back that were run using the 5.07 release of Human Protoeme Folding - Phase 2. We have experienced no 1282 errors and only a few of the '-1073741819' errors. The -1073741819 errors that are occuring are on machines that are 'cyclers'. These machines typically have some problem that causes them to run the project incorrectly.

One item that users should be aware of. The update from 5.06 -> 5.07 has caused floating point values to be computed with a slightly different value. This has no effect on the scientific value of the data computed. However, it does mean that results returned by the 5.06 application will not compare with the a result returned by a 5.07 application. This means that we will experience a period where there is a higher then normal number of inconclusive and invalid results statuses assigned to results returned by the users.

Users who recieve an invalid result will still recieve credit for their work. They will get full credit for the result and the time they spend working on the result. They will be awarded either the canonical credit or the claimed credit, whichever is less.

We apologize for the inconvience and we appreciate your continued support and contribution to our efforts.

[Jun 29, 2006 2:39:30 PM]

Viktors
Former World Community Grid Tech
Joined: Sep 20, 2004
Post Count: 653
Status: Offline
Project Badges:


Re: Some HPF2 workunits aborting

A new version of Rosetta is being used starting today for the UD agents. It should behave better with regard to the throttle settings, but some more work is forthcoming on this. Also, it used a newer version of the compiler and larger stack size which seems to have reduced the incidence of aborted work units in our tests. Your agent will automatically download the updated code. The first time it communicates with the agent, it will take somewhat longer to download the updated Rosetta code. After that, work unit downloads will resume their normal size. Sorry for any inconvenience and your patience.

[Jul 4, 2006 4:42:43 PM]

Viktors
Former World Community Grid Tech
Joined: Sep 20, 2004
Post Count: 653
Status: Offline
Project Badges:


Re: Some HPF2 workunits aborting

There have been various posts in the forums about long running, seemingly stuck, HPF2 work units, work units that quit early, and ones for which different agents get divergent answers. Most of the work units seem to be processing normally and are completing properly. But, we know that there are a few work units, which behave in unusual ways. There are different causes for this. For ones that seem stuck for a long time, the Rosetta program is probably trying to figure out if they are non-converging or not. Ones that quit early are probably subject to a subtle bug in Rosetta. To figure out how best to handle and fix these work units, we need to identify them so that we can do further testing and debugging on them. Instead of terminating problem work units, it would be useful to the tech team if the members identified the particular agent running the work unit (for example using the UD device ID number on the preferences window of the agent (checkmark icon)) and the UTC time and date at which it was running. We have asked the community advisors to help us collect information about these work units so we can use them in our investigations. We are unable to find all such unusual work units in our testing prior to launch because they are relatively rare. On the production grid, we process a tremendous amount of work each day and thus very subtle problems reveal themselves. Members who call attention to specific unusual work units will be doing a great favor to us. Our behind-the-scenes testing of problem work units is very time consuming. So if members simply let these unusual work units finish, we will be able to tell more about what was going on instead of losing that information.

We will probably be making some changes in Rosetta to speed up the detection of non-convergent work units, making the progress bar show finer progress increments or use some other means to show if the work unit is "stuck" or not. Finally, there seems to be a subtle bug, which aborts a few work units. Some of these work units have to run a long time to get to the point where the problem occurs and shortcuts seem to hide the bug in some cases. So the testing and debugging of these requires a lot of time. Please be patient with us as we take care of these problems. Furthermore, our team is extra busy, divided on project work, getting an additional research project ready for launch very soon. So, thank you for your patience and assistance.

[Jul 7, 2006 2:17:52 AM]

[ ]