Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: Help Defeat Cancer Thread: The truth about memory / disk usage. |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 27
|
Author |
|
Eric-Montreal
Cruncher Canada Joined: Nov 16, 2004 Post Count: 34 Status: Offline Project Badges: |
Since the "Help defeat Cancer" project started, my main machine (Athlon 2.5 / 1024Mo RAM) started having prolonged periods where it's completely crippled, sometimes for more than 2 minutes in a row. When this happens, even typing on the keyboard is choppy at best.
I've read the threads in here about if being either non existent or the result of a badly misconfigured machine swapping like crazy. It seemed quite strange, despite the (IMHO unreasonably high) memory usage of the HDC software, and something seemed fishy about it, so I decided to dig a bit. Sure enough, that's not a swapping problem. First, in the Windows task manager, I enabled (in Display) the IO Read / IO write and IO Misc columns and discovered that each work unit was responsible for reading and writing a total amount of around 15 Gigs of data ! These numbers do *not* include the Virtual Manager swap activity, only files read / written by the application. Here is a screen capture (read+write = 14 Gigs): http://montreal.pages-web.com/images/taskmanager.png Using Sysinternal's "File Monitor" software ( http://www.microsoft.com/technet/sysinternals/FileAndDisk/Filemon.mspx ) I found out it had huge bursts of activity on C:\Program Files\WorldCommunityGrid/tkop.ud and C:\Program Files\WorldCommunityGrid/tkop.ud~ and that was the cause of the complete lack of responsiveness during that time. Here is a view of it : http://montreal.pages-web.com/images/filemon.png Clearly, it's copying from one file to the other, but using a ridiculously small 4K buffer ! This causes a huge number of seek/read/seek/write cycles and is real close to being the worst possible / most inefficient way to copy a file. For peoples lucky enough to have a SATAII (or SCSI!) drive with NCQ enabled and / or a dual processor machine, the effect is somewhat lower, but still far from optimal. Better hardware is for better performance, not to hide incredibly bad programming methods. The files by themselves are not big (max I've seen is around 600 Mo and it changes a lot during the work unit execution, peaking at around 1/3 of the total execution time. It seems to write 2 copies of the data to those files, and then read both of them (maybe to compare their content) and it does so many times in a continuous burst before it deletes one of them. Ok, so maybe it's justified by the whole thing being "Rocket science" far beyond what my tiny mind will ever grasp, but from what I'm seeing, it looks more like sloppy programming and a complete disregard for the user giving you the *privilege* to use his/her machine. Complete audit of the software you intend to run on our machines, including impact on the user should be a crucial step in the evaluation of any project submitted to you before you accept it. The whole principle was that you would be using the *idle* time on our machines, with a clear understanding that the whole process should occur in the background with no or very little degradation of the user experience. The human proteome project was near perfect. It used little memory, few disk accesses, and it could be installed on any (reasonably equipped) machine with a near zero impact for the user. At this time, I recommended anyone to give it a try, an even started the Montreal team (about 70 people to this day). Then came "Fight Aids at Home". The project seemed very worthwhile despite it's well above reasonable memory usage that led to a clearly noticeable performance degradation on machines with up to 512Mo of ram. At that time, I started getting a few complaints (my machine is slow, can you check if I have a virus ?) and removed it / limited it to screensaver mode for machines that were a bit low on memory. "Help defeat Cancer" turned it from a being a small annoyance to a major repetitive disruption for the user and this abusive behavior of the "Help defeat Cancer" project is simply unethical and not acceptable. I completely stopped promoting it and I think about simply disabling this project in my own machine. Given the situation, and the current trend toward ever increasing resource usage (abuse) I find it impossible to recommend anyone that they run it on their machine (or maybe only as screensaver) and I think it explains why the WCG project does not have way more users despite the very valuable goals. I simply don't want to encourage anyone who trust me with their computer to use something that will behave badly and backfire. Here are the steps I think should be implemented ASAP to correct the issue: - Rewrite the copying routine to use a much larger buffer (you're using 500+ megs of ram, 512K or 1 meg buffer won't make much of a difference !) - Have a look at other similar "Programming 101" mistakes in IO & memory allocation. - At least, throttle the IO operations so that the machine is not crippled by it. - If the file is used only for backup, then skip the backup entirely. Execution time for HDC is usually between 1:30 and 3H so it should complete most of the time flawlessly. If not, then restart it the next time upon power up. - Do not duplicate the file (tkop.ud~ and tkop.ud~) if your concern is about data integrity, use integrity check instead (CRC) - Have the software audited and optimized for memory footprint reduction - If possible, repartition the data into smaller chunks. In the long run, and for future projects, you should use proactive measures to ensure that the user experience is not negatively impacted by WCG, this is the key to wider acceptance. Namely, you should monitor the overall machine activity and behave accordingly. Here are a few mechanisms that could be used: - before, and while performing disk or network IO, carefully monitor system activity indicators to detect use of these resources by other applications. Whenever, and for as long as significant activity by other applications is detected, slow down or simply stop using the resource. - Monitor total memory load and when WCG software would cause a significant amount of paging, simply stop it and wait for either a lower memory load or screensaver mode (slowing it down would have nearly no impact on memory swapping to disk). - Monitor user activity (mouse/keyboard) and adapt to the situation. Spywares do it to avoid tipping the user about their presence, I guess you should be able to do it too ... - Force people who develop the payload softwares to optimize them for small footprint / lowest possible user disturbance and reject projects that don't match those quality goals. - If some application really can't be optimized, than make the participation in this project either off by default (display a dialog warning the user about the new project, disclosing how it will use the machine and ask the user), and / or put it in 'screensaver only' mode. - Monitor long term machine activity and feed it with work units from projects that are a good fit for their capabilities. - Establish strict guidelines for the next projects. People installed WCG on their machine for the ethical/useful/socially responsible aspect of the project, else we'd be running seti or a fancy screensaver instead. Please show some respect by using our machines responsibly. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It looks like you missed the explanation of this behaviour.
The problem is not WCG's or the project scientists' fault. The problem lies with the UD agent. The UD agent encrypts all files, whether it needs to or not. As you can see, it does this very inefficiently, and HDC writes abnormally large checkpoint files. This issue does not exist if you use the BOINC agent. Unfortunately, United Devices are not actively developing the grid agent used by WCG, and have been disinclined to address any issues that have arisen. Use BOINC. We have asked that this be added to the list of "known problems", so people know what is going on and can choose whether to run HDC or not based on all the facts. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I should add:
WCG have already throttled the IO, due to concerns about the poor encryption performance. As a result, it isn't an issue for most people. Perhaps your computer has DMA disabled or some similar problem making this problem worse. WCG have also reduced the memory requirements of HDC from a whopping 1.2 GB to the current peak of around 500 MB (while checkpointing). WCG don't place unnecessary restrictions on projects with regard to memory and disk use. Some science does need a large amount of physical memory, for example certain matrix operations. What WCG can and do do, is reflect the memory needs honestly in the minimum requirements for the project. As a result, projects with high requirements should only run on computers equipped to deal with the load. The good news is that at least one of the upcoming projects will have extremely low memory requirements. This will keep lower end computers busy. Finally, the upcoming version of BOINC has new features for better memory handling, with a view to avoiding paging where possible. This is still in development. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Hi Eric-Montreal,
----------------------------------------Ditched UD Agent 6 months ago for BOINC as main agent. It does a much better job and as u can read in other posts a few days ago, one user that could not run FAAH on his lower spec machine under UD, was asked to try BOINC and he managed to get it running without much ado http://www.worldcommunitygrid.org/forums/wcg/viewthread?thread=9648#75753 Sure, if u read my posts on the 'real memory needs' of HDC, it is a heavy duty project and in fact 750mb min ram is not for fun, even 1gb is a stretch, but a few under lighted items plus conversion to BOINC can tremendously improve the experience. That's not only a well defragged disk and page file, but also a setting in BOINC that is default set to disk-write every 60 seconds. I've set it to its max of 999 seconds, which made a big impact. Further, a throttle development is underway for BOINC. Myself, both for the UD agent and BOINC use the TreadMaster utility and set the activity to 80% CPU time. Does not change the disk-writing, that unfortunately happens at a higher priority and cannot be controlled by the agents in terms of priority. As Didactylos points out with BOINC u dont have this 4k buffer cram. FAAH and HDC and the about to restart HPF2 on BOINC all having a graphics screen similar to the UD agent, there is little to nothing left to stick with UD, be it on single core or multicore CPU PC's. With the Work-buffer function, in fact it just never stops to compute, even during sending and receiving of results. Anyway, it does not cause any choppy-ness on my almost 5 years old machine during any typing (oh and the mentioned DMA is on, which it not always is by default). cheers
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Alther
Former World Community Grid Tech United States of America Joined: Sep 30, 2004 Post Count: 414 Status: Offline Project Badges: |
Good job on the analysis, but you're pointing the finger at the wrong software.
----------------------------------------The problem with the disk activity you are seeing is directly related to the UD agent. HDC writes very large checkpoint files. To counter this, I've implemented an I/O throttling mechanism, which works on both UD and BOINC. Checkpointing is a 2 step process. For UD it goes like this:
Step 1 and 2 happen at the science app level. This is what we have control over. Code running at these levels are also running at the lowest priority. Step 3, on the other hand, happens in the UD Agent. This is where the disk I/O problem is occurring. When the science app tells UD to checkpoint, UD makes and encrypted backup of all the checkpoint files and verifies them. This is the piece that is impacting some users. Since it is in the UD Agent code itself, we have no control over it. Not checkpointing is not an option. Many people have slower machines or they are running with a reduced throttle. They don't want to loose much work if they have to restart. This problem doesn't occur with BOINC because the files aren't stream encrypted like they are in UD. As for some of your other points: Memory Usage - Remember, we don't write the science apps. We just port them to the grid platform. We do security audits on the software, make suggestions if we see anything that could be easily improved. We are not going to rewrite anyone's software. These apps have been under development for years. It's their software, not ours. We have made improvements and fixed bugs in all the apps so far and given those changes back, but for the most part, if we have issues with the software, we ask the original implementors to fix it as they are in the best position to do that. One example is HDC. When we first looked at the app, it was consuming, on average, 1.2GB of RAM. This was way too much and we worked with them to lower the footprint down to its current svelte fighting weight of a mere 500MB. This is a huge savings. You also have to remember that these science apps are, for the most part, massive data crunching programs. They do use that much memory. HDC is actually an image processing application and image processing takes a lot of memory. You also have to remember that many of these apps have been under development for years and ported from other languages as they evolved. Many are written in universities and run on high powered servers and clusters. Results and accuracy are the top priority for them, not memory efficiency and processing speed. Now that they're on the grid, all have made improvements to the code to help these issues. Believe me, we are well aware of memory requirements and work with the app owners to get it as small and efficient as possible without a complete rewrite. To reflect this and to have as small an impact on users, we set minimum requirements for all apps. You can always opt out of a specific project if you wish or set up a schedule so it runs during "off" hours where you're not likely to be using the machine. Most of your other points deal with the agent directly. As I said, we have no control over the UD agent, but the BOINC agent is gaining many of these features. Look for them in future versions, some are not far off.
Rick Alther
Former World Community Grid Developer |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I converted from the UD agent to the BOINC 5.7.2 beta code, and am very happy I made the switch! HDC runs without causing performance degradation on an Athlon 3200 with 1 GB of memory....while it caused performance problems on another machine with 1.5GB of memory with the UD agent.
Thanks for your advice and insights into the differences between the UD and BOINC agents. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
also a setting in BOINC that is default set to disk-write every 60 seconds. I've set it to its max of 999 seconds, which made a big impact.
Where in boinc do I find this setting? I've Looked and I can't see it anywhere |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It's in the device profile on the website. Start by going to My Grid.
|
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
also a setting in BOINC that is default set to disk-write every 60 seconds. I've set it to its max of 999 seconds, which made a big impact. Where in boinc do I find this setting? I've Looked and I can't see it anywhere It's on the BOINC profile here: http://www.worldcommunitygrid.org/ms/device/v...iguration.do?name=Default go find the line: Write to disk at most every: 999 seconds This morning it was 24168 BOINC users.... now +2 :D
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All! |
||
|
Eric-Montreal
Cruncher Canada Joined: Nov 16, 2004 Post Count: 34 Status: Offline Project Badges: |
I did read your replies, and I'm sorry to say, but you simply did not get the point.
I certainly could use BOINC, but the fact that my little AMD 2.5 or my T2050 laptop does participate in a project or another was not my point. However, I checked everything (including IDE DMA Mode) but I still had the problem so, I finally had to disable the Help Defeat Cancer Project. Most other people with similar problems will simply uninstall the whole application and are very unlikely to participate again in such a project. This gives distributed computing a bad name. The bigger picture (and the sad truth) is that for the average person who tries to participate with a Windows machine, the project as a whole will, more often than not, severely slowdown their machine, and this makes participation impractical for most people / businesses. As a direct consequence, despite the laudable goals of this project, 2 years after it's launch, the top teams are still Easynews, Slashdot Users, IBM and Clubic, all IT/geek teams. Something must be wrong. With research projects on high profile topics such as Aids and Cancer, it could be expected that the main teams would be the Cancer society, HIV/Aids organizations, online health websites, leading breast cancer support groups, etc... Endorsement of the project could be used by computer vendors, mainstream media / consumer brands to show 'good corporate citizenship' easily, leading to millions of new participants but as long as the adverse effects (slowdowns, huge resource use) for the user are comparable to a spyware infection, it's simply impossible to recommend it (and I don't care if it's the payload or UD that's responsible for it) I have very good contacts within the 2 major HIV/Aids organizations here in Montreal (Sero-Zero & Farha Foundation), but, sorry, there's no way I will even try to convince them to endorse WGC, because I know it's going to create performance problems for the people who install it on their machines and it'll backfire against me and might even hurt the reputation of these organizations. About BOINC ----------- People are aware about internet malware and they're naturally suspicious. Add to this the countless medical hoaxes and you have to be perfectly clean to convince them to install your software on their machine. The website itself rightfully conveys that trustworthy image. If you tell them to participate, but they should not use the software you provide by default (because it's a mess that won't be fixed) and they should go to a very geeky site and download a somewhat experimental version of yet another software called BOINC (BTW, http://boinc.berkeley.edu/download.php?min_version=5.2 is *not* a page anyone (that means except geeks) will feel comfortable getting such a software from), then add threadmaster, configure the whole thing for ... Forget it, game over ! you've lost 95% of them, the natural reaction being: You'll come back to me once it's not such a mess. If Boinc is *that* good, then make it the default choice for Windows users. If it's not ready yet, use UD only for projects where it's performance is acceptable. As an early user (11/16/2004) who was so enthusiastic about this project I started a team (Montreal, 74 users), I really hope the situation will finally improve and WCG projects can be made a bit more respectful for the people donating their computational power and run without 'collateral damage', so that it's again possible to recommend it to anyone without being worried about them being mad at you and calling you to remove it a week later. --------------------- Alther wrote: Remember, we don't write the science apps. We just port them to the grid platform. We do security audits on the software, make suggestions if we see anything that could be easily improved. We are not going to rewrite anyone's software. You don't write the apps, but you can simply refuse them until they behave properly. If they really think WCG's involvment is important, they'll comply. Writing software that runs in the lab on a supercomputer with unlimited RAM/resources and software that runs in the background in millions of machines have different requirements, and if they want the benefits of using our machines, they'd better learn the rules. |
||
|
|