Jump to content
UBot Underground

Simple (I Think) Scraping Challenge


Recommended Posts

Hi,
 
I'm fairly new to this and am struggling with what is likely a simple issue to anyone with experience.
 
On this page http://bazoogle3.com/testscrape2/I am trying to scrape data that is in the following positions on multiple pages per the example page
 
1) "Visits" "This period" which is 136 in this example
2) "Page Views" "This period" which is 296 in this example
3) "Mobile Visits" "This period" which is 62 in this exmaple. 
 
You can view the source of the page, but the relevant sections are:
 
For 1 and 2 

<table class="trafficSummary">
<tbody><tr>
<!-- Summary data -->
<td class="summaryData">
<p class="date">June 2017<span></span></p>
<table class="stat_table">
<tbody><tr class="at">
<td class="stat_label"> </td>
<td class="label">THIS PERIOD:</td>
<td class="label">MOST RECENT 12 MONTHS:</td>
</tr>
</tbody></table>
<table class="blue stat_table">
<tbody><tr>
<td class="stat_label"><h3>VISITS</h3></td>
<td class="num">136</td>
<td class="num">4,441</td>
</tr>
</tbody></table>
<hr>
<table class="green">
<tbody><tr>
<td class="stat_label"><h3>PAGE VIEWS</h3></td>
<td class="num">295</td>
<td class="num">8,923</td>
</tr>
</tbody></table>
<p>
Visits represent the number of potential clients who visited your 
website or blog. Page Views are the total pages they viewed.


<br><br><i>Current month not included in 12-month totals or graphs. Data current within 72 hours.</i>


</p>
</td>
<!-- End summary data -->
For #3 
<table class="trafficSummary">
<tbody><tr>
<td class="summaryData">
<p class="date">June 2017<span></span></p>
<table class="stat_table">
<tbody><tr class="at">
<td class="stat_label"> </td>
<td class="label">THIS PERIOD:</td>
<td class="label">MOST RECENT 12 MONTHS:</td>
</tr>
</tbody></table>
<table class="blue stat_table">
<tbody><tr>
<td class="stat_label"><h3>MOBILE VISITS</h3></td>
<td class="num">62</td>
<td class="num">1,244</td>
</tr>
</tbody></table>
<hr>
<table class="green">
<tbody><tr>
<td class="stat_label"><h3>PERCENT</h3></td>
<td class="num">46%</td>
<td class="num">28%</td>
</tr>
</tbody></table>
<p>
Mobile visits include visits from mobile phone and tablet devices.
The percent displayed represents how many total visits were from mobile devices.


<br><br><i>Current month not included in 12-month totals or graphs. Data current within 72 hours.</i>


</p>
</td>


<td class="chart">
<h3>MOBILE VISITS</h3>
<div class="chart_img">
<img src="index_files/mobileVisitsChart-3779360-201706.png" alt="Visits / Page Views">
</div>
</td>
</tr>
</tbody></table>
For the first step, I am trying this code to srape "Visits" "This period".  I tried using the selector and then adding a wild card.
 
navigate("http://bazoogle3.com/testscrape2/","Wait")
set(#var1,$scrape attribute(<class=w"*">,"innertext"),"Global")
 
But, as you can see in debugger it generates a lot more data than the targeted 136 :(
 
Can anyone provide me with some guidance/direction/solution to sucessfully scrape the 3 pieces of data that I seek, noting again that I am going to be load multiple similar pages that have different data in each of those positions.
 
Note that I just purchased the Ex Browser plugin (but have not even opened it yet) so if there is a better solution using that please don't hesitate to offer the associated guidance.
 
Thanks very much!
Chris
Edited by christojuan
Link to post
Share on other sites

Hi, for this kind of page, the easiest way is using xpath parser.

You can use Free Xpath Plugin by Dan. 

http://network.ubotstudio.com/forum/index.php/topic/19449-free-xpath-plugin/

 

here the code: 

navigate("http://bazoogle3.com/testscrape2/","Wait")
set(#var1,$plugin function("XpathPlugin.dll", "$Generic Xpath Parser", $document text, "//td[@class=\'stat_label\']/h3[contains(text(),\'VISITS\')]/../../td[2]", "innertext", ""),"Global")
alert(#var1)

and here are the results: 

http://i.imgur.com/gjWkIua.png

 

Hope it helps. 

Link to post
Share on other sites

Hey I just quickly ran it through x path pro...  is it possible for you to show me how I would isolate each of those?

e.g. if  just want the 136 (Visits) or just the 295 (Page views), or just 106 (search visits) etc.

https://www.screencast.com/t/zdV3g2RXD 

 

any additional help would be appreciated.

Thanks!
Chris

Link to post
Share on other sites

Hey I just quickly ran it through x path pro...  is it possible for you to show me how I would isolate each of those?

e.g. if  just want the 136 (Visits) or just the 295 (Page views), or just 106 (search visits) etc.

https://www.screencast.com/t/zdV3g2RXD 

 

any additional help would be appreciated.

Thanks!

Chris

 

Yes you can isolate each of those. 

You can achieve it with 2 ways: 

 

1. based on sequence 

results no 1
(//td[@class='stat_label']/h3[contains(text(),'VISITS')]/../../td[2])[1]

results no 2
(//td[@class='stat_label']/h3[contains(text(),'VISITS')]/../../td[2])[2]

results no 3
(//td[@class='stat_label']/h3[contains(text(),'VISITS')]/../../td[2])[3]

2. based on h3 text --> //td[@class=chart]/h3 as starting point and then you continue the xpath to the destination element

http://i.imgur.com/3HsaAul.png

Link to post
Share on other sites

You, my friend, are awesome.  Before I saw your response,  I went through Dan's X path training and your examples/solutions helped bring it all together.  Thank you SO much for taking the time to help.

I sincerely appreciate it.

Edited by christojuan
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...