gtsitour 0 Posted September 11, 2010 Report Share Posted September 11, 2010 Hi everybody,I'd like to know how would you handle a situation like the one described below:Imagine the following html structure.. <div id="content"> <h2>First Group of Categories</h2> <ul class=grayul> <li><a href="/folder/category47-page1.html"><b>Building</b></a> <span class=black>(554)</SPAN> – Found in architecture and building, such as bricks, pavements, tiles etc. <li><a href="/folder/category55-page1.html"><b>Frames</b></a> <span class=black>(86)</SPAN> – Picture frame products. <li><a href="/folder/category14-page1.html"><b>Misc</b></a> <span class=black>(1113)</SPAN> – Products that do not fit into any other category. </ul> <h2>Second Group of Categories</h2> <ul class=grayul> <li><a href="/folder/category44-page1.html"><b>Creative</b></a> <span class=black>(663)</SPAN> – Artistic and Creative products. <li><a href="/folder/category45-page1.html"><b>Misc</b></a> <span class=black>(452)</SPAN> – Products that do not fit into any other category. </ul> </div> Let's say that we want to scrape the href , the title , the count and the description of the lis of only the first ul.As you can see both uls have the same class name and follow the exact same structure.The only thing that makes them different is the h2 heading before each of them. How would you handle such a situation? Thanks in advance Quote Link to post Share on other sites
JohnB 255 Posted September 11, 2010 Report Share Posted September 11, 2010 It looks like the number in the class can be used to delineate between items. Quote Link to post Share on other sites
gtsitour 0 Posted September 11, 2010 Author Report Share Posted September 11, 2010 Unfortunately this is not a number, it's the lowercase L.The class name is the same for both uls and it's "grayul". The only thing that is different between the uls is the leading <h2> tag. Any ideas? Quote Link to post Share on other sites
IRobot 43 Posted September 11, 2010 Report Share Posted September 11, 2010 Let's say that we want to scrape the href , the title , the count and the description of the lis of only the first ul.As you can see both uls have the same class name and follow the exact same structure.The only thing that makes them different is the h2 heading before each of them. How would you handle such a situation?$page scrapeAdd the items to a $list Then process the first item in the list (i.e. the first URL). Quote Link to post Share on other sites
gtsitour 0 Posted September 11, 2010 Author Report Share Posted September 11, 2010 I tried that but the $page scrape gets all the lis from both the uls and not just the lis from the 1st ul that we need. I also tried to choose by attribute the 1st ul and then use the $scrape chosen command but no luck. I can't find a way to process/scrape just the first ul.. Quote Link to post Share on other sites
Seth Turin 223 Posted September 11, 2010 Report Share Posted September 11, 2010 here's a piece of ubot ninjitsu to try: choose the first ul by positionscrape the outer html to a variable, we'll call it 'u' choose the page's bodychange the outer html attribute to the variable u. this will isolate only the ones you want, so that you can choose them and scrape them the way you would normally. Quote Link to post Share on other sites
gtsitour 0 Posted September 11, 2010 Author Report Share Posted September 11, 2010 This is indeed ninjitsu! :-) LOL ..but sounds absolutely logical! I can't wait to try it out. I'll report back the results.Thanks Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.