| Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
| World Community Grid Forums
|
| No member browsing this thread |
|
Thread Status: Active Total posts in this thread: 17
|
|
| Author |
|
|
wujj123456
Cruncher Joined: Jun 9, 2010 Post Count: 38 Status: Offline Project Badges:
|
I've been playing with web API recently, and I realized it returns duplicated results.
In my case, even if I set query limit to 1000, I at most get 250 results back. Say there are 730 results, I do: Fetch from offset=0. Returned 250 results Fetch from offset=250. Returned 250 results Keys already exist: OET1_0003699_x3MWPp_rig_61457_0 ugm1_ugm1_25075_0758_1 ugm1_ugm1_25075_0679_0 Fetch from offset=500. Returned 230 results Results returned from web: 727 When I combined them, the total is not 730 but 727 results. The three duplicated keys have exact same data in both queries. It's not hard for me to workaround this, as python dict update handles it naturally anyway. Just wondering if it's something WCG team would care, since it might be a bug somewhere. If this is not the right place to report API issue (or general bugs), please let me know. I can move this to appropriate channel. Thanks. |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Since the Result Status data is 'live' [dynamic] and the fetch per call is restricted to 250, there is a good chance that a next call will fetch a result again, as in the meantime any of the first 250 or a later result in the fetch order did so to move to top, shifting everything 1 or more down. The faster your internet connection and the more optimal your concatenated queries run, the less chance this may occur.
----------------------------------------[Edit 1 times, last edit by SekeRob* at Apr 24, 2016 8:19:24 PM] |
||
|
|
wujj123456
Cruncher Joined: Jun 9, 2010 Post Count: 38 Status: Offline Project Badges:
|
Since the Result Status data is 'live' [dynamic] and the fetch per call is restricted to 250, there is a good chance that a next call will fetch a result again, as in the meantime any of the first 250 or a later result in the fetch order did so to move to top, shifting everything 1 or more down. The faster your internet connection and the more optimal your concatenated queries run, the less chance this may occur. I thought about this, because I do see order changing sometimes. However, overall fetched results should match the number of available results, across multiple back to back runs. Each run only takes seconds... In my tests, these WUs have consistently been shown twice. That's why I think it's duplicated in the data, instead of the timing on my side. Honestly it's not really a concern to me given that I can tolerate losing some stats. If it's just a timing issue, I would probably end up fetching all results anyway since I run it periodically. I wonder why cap the limit at 250? It doesn't really help since most people would have to loop and query all results anyway. It might be much more efficient to let people pass in ModTime and only return WU with a modified time older than that, if it's trying to save some server resource. [Edit 1 times, last edit by wujj123456 at Apr 24, 2016 8:51:25 PM] |
||
|
|
Tullus
Cruncher Joined: Nov 14, 2008 Post Count: 29 Status: Offline Project Badges:
|
Hi wujj123456. If you are playing with the web API from python, you might be interested in part of:
https://code.google.com/archive/p/py-boinc-plotter/ You will most likely want to look at parser.HTMLParser_worldcommunitygrid and task.Task_web_worldcommunitygrid. I haven't bothered to move the project away from google.code, so some of it might be a bit dated, but should not be too bad. |
||
|
|
wujj123456
Cruncher Joined: Jun 9, 2010 Post Count: 38 Status: Offline Project Badges:
|
Hi wujj123456. If you are playing with the web API from python, you might be interested in part of: https://code.google.com/archive/p/py-boinc-plotter/ You will most likely want to look at parser.HTMLParser_worldcommunitygrid and task.Task_web_worldcommunitygrid. I haven't bothered to move the project away from google.code, so some of it might be a bit dated, but should not be too bad. Thanks. Well, I guess you missed Google announcement a year ago: http://google-opensource.blogspot.com/2015/03/farewell-to-google-code.html Google code is gone, along with all projects hosted on it. :-( From the method name, are you parsing HTML instead of using the API? |
||
|
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
Since the Result Status data is 'live' [dynamic] and the fetch per call is restricted to 250, there is a good chance that a next call will fetch a result again, as in the meantime any of the first 250 or a later result in the fetch order did so to move to top, shifting everything 1 or more down. The faster your internet connection and the more optimal your concatenated queries run, the less chance this may occur. I thought about this, because I do see order changing sometimes. However, overall fetched results should match the number of available results, across multiple back to back runs. Each run only takes seconds... In my tests, these WUs have consistently been shown twice. That's why I think it's duplicated in the data, instead of the timing on my side. Honestly it's not really a concern to me given that I can tolerate losing some stats. If it's just a timing issue, I would probably end up fetching all results anyway since I run it periodically. I wonder why cap the limit at 250? It doesn't really help since most people would have to loop and query all results anyway. It might be much more efficient to let people pass in ModTime and only return WU with a modified time older than that, if it's trying to save some server resource. On the emphasized, no, because between the first and second fetch request one or the other canonical result could already have been migrated off... the misses. To give an approximation, 12 results come on and another 12 come off per second [1 million plus per day], which on accounts that have thousands on their result status pages leads to a continuous reordering due status changes and the [ModTime] of the momentary transaction. Duplicates on the database: On a willy nilly system that would be possible, yes (The Result Status pages are a direct window to what's going on, on the core BOINC task/result scheduling system.)[Edit 1 times, last edit by SekeRob* at Apr 25, 2016 11:39:23 AM] |
||
|
|
wujj123456
Cruncher Joined: Jun 9, 2010 Post Count: 38 Status: Offline Project Badges:
|
I tried a few more times and looks like the duplicates aren't always the same, even though the number of duplicates I got for same number of total results are mostly the same. So it does look like random timing issue, at least across longer periods. I guess the repeated same offenders I got yesterday might just be a coincidence.
Looking more closely to the fields, it doesn't seem to be ordered in anyway. (I am only querying results with ValidateState=1.) However, on website there is a way to order results. Do you happen to know if I can specify an order with web API? It's an SQL query at the end, but not necessarily exposed to web API I suppose. All I need is some ordering that would be stable for 1 minute for me to deterministically get all results back. So anything other than ModTime will probably work. (I assume getting all results in one-shot is off-limits given that a cap is implemented at first place.) |
||
|
|
Tullus
Cruncher Joined: Nov 14, 2008 Post Count: 29 Status: Offline Project Badges:
|
Hi wujj123456. If you are playing with the web API from python, you might be interested in part of: https://code.google.com/archive/p/py-boinc-plotter/ You will most likely want to look at parser.HTMLParser_worldcommunitygrid and task.Task_web_worldcommunitygrid. I haven't bothered to move the project away from google.code, so some of it might be a bit dated, but should not be too bad. Thanks. Well, I guess you missed Google announcement a year ago: http://google-opensource.blogspot.com/2015/03/farewell-to-google-code.html Google code is gone, along with all projects hosted on it. :-( From the method name, are you parsing HTML instead of using the API? Yes, I know about the google code farewell, just to lazy to move. I was about to tell you that the 'downloads' still work, but found they are mostly empty .tar files, so I must have messed up. Moved it to github: https://github.com/obtitus/py-boinc-plotter/ I still need to move the wiki somehow. I was originally parsing the html, so therefore the name :) I am still parsing xml for the badges and stuff. |
||
|
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7846 Status: Offline Project Badges:
|
(I am only querying results with ValidateState=1.) Even doing this you will get duplicate units where all is the same except for the mod time. I have stopped wondering why the mod time would change on a unit which has already been validated. It must make sense to the techs. Once I stick my query into a spreadsheet, I can deal with the duplicate issue.Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
wujj123456
Cruncher Joined: Jun 9, 2010 Post Count: 38 Status: Offline Project Badges:
|
(I am only querying results with ValidateState=1.) I have stopped wondering why the mod time would change on a unit which has already been validated. It must make sense to the techs. This I happen to see it happening. Modtime can change after validation because the files are deleted. That might eventually happen for all results, but yeah, from user's point of view, it's no longer interesting once results are validated. |
||
|
|
|