Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 89
Posts: 89   Pages: 9   [ Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 464870 times and has 88 replies Next Thread
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

My Statistics and My Team to capture daily Project Stats for myself and the UK Team ( Project order is not a concern).

Capture of All Time Stats and Last Result Returned for individual Team Members via Multiple Member Comparison .

For these you can use the XML - right?

I also occasionally use a screen scrape to capture data for members identified as having a Great Britain location in Statistics by Geography . (Data by country does not appear to be available in XML format.)

Unfortunately - it is not available.

Is it anticipated that the data currently available in XML format will be affected by the ongoing website redesign?

Not in the near future. Down the road (mid-2014 at the earliest) we will be developing some better visualizations of the data and that is going to require a better API to be developed than what we have now. We might deprecate it sometime after that.
----------------------------------------
[Edit 1 times, last edit by knreed at Feb 5, 2014 3:34:52 PM]
[Nov 12, 2013 12:06:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3716
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

However, if you have fr_fr as your primary language in your browser, then when you arrive on our site, the French formatting rules should be in effect. Let me know if you see otherwise.
It works as you say, Kevin, and now I can rebuild the scenario of my mysterious changes:
Sometimes I switch languages for whatever reason and when I come back to French the "Canadian" setup leaves me with the wrong dormat. And since I logoff/logon only once a month to enroll my team to new challenges I could stay with the wrong format for several days.

Now I know how to quickly fix it by correcting the parameter in the address line if necessary, and anyway I have forced the correct language setting in all my stats bookmarks now.
So everything is fine for me. smile

Thanks Kevin.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Nov 12, 2013 8:14:28 AM]   Link   Report threatening or abusive post: please login first  Go to top 
jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

My Statistics and My Team to capture daily Project Stats for myself and the UK Team ( Project order is not a concern).

Capture of All Time Stats and Last Result Returned for individual Team Members via Multiple Member Comparison .

For these you can use the XML - right?


I've switched All Time Stats and Last Result Returned to XML smile but I can't make XML work for the UK Team Project Stats sad

Any suggestions confused

that is going to require a better API to be developed than what we have now.

I'll look forward to that. biggrin
----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread
[Nov 12, 2013 8:46:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Tullus
Cruncher
Joined: Nov 14, 2008
Post Count: 29
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

I do scraping of the public xml on: /verifyMember.do?name={name}&code={code}

In addition I scrape the html task list, in a similar manner to WCGDAWS, although my program didn't break ;)

My program (which is available here: https://code.google.com/p/py-boinc-plotter/), works for multiple boinc projects, but has to treat worldcommunitygrid as an exception in many parts of the code.

If you could integrate better with the standard boinc environment that would be fantastic. Either by contributing so that other boinc projects can utilize your work, or by utilizing/modify the existing boinc webpage structures. In this way your webpage will benefit from the open source community, and the open source community would benefit from you.

In addition, extending the xml support would be great, since parsing html is a pain.
[Nov 12, 2013 3:43:42 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

I have three principal instances of screen scrapping:

My Statistics and My Team to capture daily Project Stats for myself and the UK Team ( Project order is not a concern).

Capture of All Time Stats and Last Result Returned for individual Team Members via Multiple Member Comparison .

I also occasionally use a screen scrape to capture data for members identified as having a Great Britain location in Statistics by Geography . (Data by country does not appear to be available in XML format.)


@jonnieb-uk

If you had:

http://www.worldcommunitygrid.org/stat/viewCo...untryCode=GB&xml=true

http://www.worldcommunitygrid.org/stat/viewCo...untryCode=GB&xml=true
(sort could be any of cpu, points or results)

http://www.worldcommunitygrid.org/stat/viewCo...untryCode=GB&xml=true
(sort could be any of cpu, points or results)

Would that meet your needs?
[Nov 13, 2013 6:52:38 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

I'm parsing the HTML to pull from the "MY CONTRIBUTION", "Global Statistics" and "Results Status" pages. The data I'm pulling from these pages is everything but the headings, graphics and links. I'm also pulling data from the Member and Team Statistics pages using &xml=true.


For the "My Contribution" stats - how come you don't use: http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=profile#335 ?

If I were to make available something similar to the verification url but that would return data from the result status page, would you use that instead of scrapping the results status page? I'm thinking something like:

http://www.worldcommunitygrid.org/verifyMembe...amp;code=VERIFICATIONCODE

with optional parameters for
project
status of result (valid, invalid, pending verification, pending validation, etc)
[Nov 13, 2013 6:57:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

[I've switched All Time Stats and Last Result Returned to XML smile but I can't make XML work for the UK Team Project Stats sad

Any suggestions confused


which page is the 'UK Team Project Stats? I can generate this one: http://www.worldcommunitygrid.org/team/viewTe...=L721SPD4BN1&xml=true Were you referring to a different page?
[Nov 13, 2013 7:03:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jonnieb-uk
Ace Cruncher
England
Joined: Nov 30, 2011
Post Count: 6105
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

[I've switched All Time Stats and Last Result Returned to XML smile but I can't make XML work for the UK Team Project Stats sad

Any suggestions confused


which page is the 'UK Team Project Stats? I can generate this one: http://www.worldcommunitygrid.org/team/viewTe...=L721SPD4BN1&xml=true Were you referring to a different page?


That looks fine Kevin, thank you smile Any problems I'll let you know.




Thats what I have been using and the additon of &xml=true does not result in .xml output
----------------------------------------

To Join follow this link: Join the UK Team All Welcome! UK Team thread
[Nov 13, 2013 7:16:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss



Thats what I have been using and the additon of &xml=true does not result in .xml output



Emphasis on the 'if' wink Before I build, I wanted to make sure it was what you need.
[Nov 13, 2013 7:52:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Screen Scrapers - Please Discuss

We rolled out the first change of the changes to our website that we are going to be frequently doing over the next 3-6 months. The HTML structure is going to be changing a fair amount as we do this rework and screen scraping will not be a reliable way to access data on an ongoing basis during this work stream. I'd like to hear from those people who are doing screen scraping and let us know what you are doing and what data you are going after and we can see what we can do to help you let your tools remain stable during these changes.

Everybody probably has a slightly different definition of screen scraping and how they implement it. This seems a particuarly apt definition:

Parsing the HTML in generated web pages with programs designed to mine out particular patterns of content. In either guise screen-scraping is an ugly, ad-hoc, last-resort technique that is very likely to break on even minor changes to the format of the data

A lot of the WCG data is available in XML format which is easier to handle and (hopefully) resilient to changes in website design. So for example if you are interested in the AllTime Runtime stats of XtremeSystems team members shown at http://www.worldcommunitygrid.org/team/viewTe...&numRecordsPerPage=10

adding "&xml=true" will provide the same data in XML format which is easily imported into a spreadsheet (in Excel using "from Web" on the Data tab).
http://www.worldcommunitygrid.org/team/viewTe...dsPerPage=10&xml=true

Unfortunately in your example Results Status is not available in XML format. I would have suggested that you use pirogue's utility programme WCGDAWS (World Community Grid Device and Workunit Stats) see thread but it's broken until updated for Fridays's changes.
I'm parsing the HTML from the results pages. This isn't what was broken. Stupidly on my part, a missing "*" brought everything to a grinding halt. I was using what I thought was a good indicator of a successful login to know whether someone was logged in successfully. blushing I changed it to look for another, hopefully more reliable, indicator.

Having a login mechanism to get at personal result data in XML would be a good thing. Doing so would probably eliminate the need for wcgdaws (also possibly a good thing, at least for me wink), but I can live with that.

I'd be willing to help test it.
----------------------------------------

[Nov 13, 2013 8:14:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 89   Pages: 9   [ Previous Page | 1 2 3 4 5 6 7 8 9 | Next Page ]
[ Jump to Last Post ]
Post new Thread