mdc101 15 Posted March 8, 2014 Report Share Posted March 8, 2014 Hi GuysBeen knocking me head against a wall for a few days and cannot figure this out. I have the following page I want to scrape attached.There are 10 <li></li> tags holding ranking information I need.Each <li></li> tag is separated by a <hr> tag.I am wanting to scrape the following below form each <li></li> tag and want to add it to table highlighted in orange so I can insert into a database. How does one scrape this data so all rows stay precise and don't get mixed up? ThanksMatt <li> <b>www.clipinhair.co.za</b> - creation date unknown <span title="Page Rank">(PR: 2)</span> <span title="Domain Authority">(DA: 19)</span> <ul> <li>Google ORV<sup>real</sup> - R 281,927</li> <li>Ranking Multiplier - 0.434</li> <li>Pages Indexed - 90</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.clipinhair.co.za/</td> <td>2</td> <td>32</td> <td> hair extensions (1)<br> </td> </tr> </tbody></table> </li> Here is the complete file: getting attaching file errors==========================================<ol> <li> <b>www.clipinhair.co.za</b> - creation date unknown <span title="Page Rank">(PR: 2)</span> <span title="Domain Authority">(DA: 19)</span> <ul> <li>Google ORV<sup>real</sup> - R 281,927</li> <li>Ranking Multiplier - 0.434</li> <li>Pages Indexed - 90</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.clipinhair.co.za/</td> <td>2</td> <td>32</td> <td> hair extensions (1)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.runwayhair.co.za</b> - creation date unknown <span title="Page Rank">(PR: 1)</span> <span title="Domain Authority">(DA: 18)</span> <ul> <li>Google ORV<sup>real</sup> - R 79,456</li> <li>Ranking Multiplier - 0.122</li> <li>Pages Indexed - 43</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.runwayhair.co.za/</td> <td>1</td> <td>31</td> <td> hair extensions (2)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.i-hairextensions.co.za</b> - creation date unknown <span title="Page Rank">(PR: 0)</span> <span title="Domain Authority">(DA: 18)</span> <ul> <li>Google ORV<sup>real</sup> - R 56,272</li> <li>Ranking Multiplier - 0.087</li> <li>Pages Indexed - 30</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.i-hairextensions.co.za/</td> <td>0</td> <td>31</td> <td> hair extensions (3)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.bidorbuy.co.za</b> - creation date unknown <span title="Page Rank">(PR: 5)</span> <span title="Domain Authority">(DA: 64)</span> <ul> <li>Google ORV<sup>real</sup> - R 40,183</li> <li>Ranking Multiplier - 0.062</li> <li>Pages Indexed - 45,300,000</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.bidorbuy.co.za/search/hair+extensions</td> <td>-1</td> <td>37</td> <td> hair extensions (4)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.glamorhair.co.za</b> - creation date unknown <span title="Page Rank">(PR: 0)</span> <span title="Domain Authority">(DA: 12)</span> <ul> <li>Google ORV<sup>real</sup> - R 32,365</li> <li>Ranking Multiplier - 0.050</li> <li>Pages Indexed - 3,050</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.glamorhair.co.za/</td> <td>0</td> <td>22</td> <td> hair extensions (5)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.frontrow.co.za</b> - creation date unknown <span title="Page Rank">(PR: 0)</span> <span title="Domain Authority">(DA: 12)</span> <ul> <li>Google ORV<sup>real</sup> - R 26,570</li> <li>Ranking Multiplier - 0.041</li> <li>Pages Indexed - 124</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.frontrow.co.za/</td> <td>0</td> <td>26</td> <td> hair extensions (6)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>hairextensionsjhb.co.za</b> - creation date unknown <span title="Page Rank">(PR: 2)</span> <span title="Domain Authority">(DA: 10)</span> <ul> <li>Google ORV<sup>real</sup> - R 22,496</li> <li>Ranking Multiplier - 0.035</li> <li>Pages Indexed - 0</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://hairextensionsjhb.co.za/</td> <td>2</td> <td>24</td> <td> hair extensions (7)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.glamourize.co.za</b> - creation date unknown <span title="Page Rank">(PR: 1)</span> <span title="Domain Authority">(DA: 19)</span> <ul> <li>Google ORV<sup>real</sup> - R 19,865</li> <li>Ranking Multiplier - 0.031</li> <li>Pages Indexed - 21</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.glamourize.co.za/</td> <td>1</td> <td>28</td> <td> hair extensions (8)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.groupon.co.za</b> - creation date unknown <span title="Page Rank">(PR: 6)</span> <span title="Domain Authority">(DA: 41)</span> <ul> <li>Google ORV<sup>real</sup> - R 19,800</li> <li>Ranking Multiplier - 0.030</li> <li>Pages Indexed - 52,100</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.groupon.co.za/coupons/beauty/hairdresser/hair-extension</td> <td>1</td> <td>1</td> <td> hair extensions (10)<br> </td> </tr> </tbody></table> </li> <hr> <li> <b>www.divadivinehair.co.za</b> - creation date unknown <span title="Page Rank">(PR: 2)</span> <span title="Domain Authority">(DA: 30)</span> <ul> <li>Google ORV<sup>real</sup> - R 18,836</li> <li>Ranking Multiplier - 0.029</li> <li>Pages Indexed - 125</li> </ul> <table class="info_table" style="margin-left: 5%; width: 95%"> <tbody><tr> <th>Page</th> <th title="Page Rank">PR</th> <th title="Page Authority">PA</th> <th>Terms (Rank)</th> </tr> <tr> <td>http://www.divadivinehair.co.za/</td> <td>2</td> <td>41</td> <td> hair extensions (9)<br> </td> </tr> </tbody></table> </li> <hr></ol> Quote Link to post Share on other sites
UBotBuddy 331 Posted March 8, 2014 Report Share Posted March 8, 2014 It's definitely doable but I would rather see the website so I can run some tests rather than work with this HTML. Quote Link to post Share on other sites
oricoun 4 Posted March 8, 2014 Report Share Posted March 8, 2014 you set "li" tags "localy" /scrape them/, you then mark them by numbers "1 to 10" , then you do your "thing" and remove numbers. I hope I understood correctly the issue. Quote Link to post Share on other sites
mdc101 15 Posted March 9, 2014 Author Report Share Posted March 9, 2014 Hi Guys thanks for the response The application is saas based and you need to login to access the page. So cant give reference urlIf you wrapped the <ol> her html head and body you will basically get the page output Oricoun I will try your suggestion. Ubotbuddy, would love to see what you do Regards Matt Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.