Jump to content
UBot Underground

Newbie Question


Recommended Posts

I am almost embarrassed to ask this question, but getting nowhere after searching and playing around testing ideas.

 

I'm dipping my toe in Ubot with the Community edition, having had a bit of rudimentary experience with iMacros - in other words, not a total newbie to scripting but also far from being skilled.

 

I set myself a challenge of making a simple Yell.co.uk scraper, and am getting stuck when it comes to scraping elements that may or may not be present. Every entry on the Yell results page has a company title and URL and phone number - no problem scraping those. But what happens when an entry has an optional field like a website (some companies indicate their website, some don't)? The addlisttolist approach doesn't work. I'm obviously missing something very basic, but can't find a tutorial that explains how to handle it. Can someone point me at a video, or some resource I can use to teach myself this?

 

Thanks in anticipation.

Link to post
Share on other sites

You almost never want to try and scrape multiple things by using something like add list to list for this very reason. One missing field means your data is invalid. So in this case you want to scrape the outer container of each result and then extract the information that way. I think regex is the best way to acheive this because you can just scrape the outer container and then easily extract what you need from there.

 

Here is a basic example of that which gets the business name and phone number from the search results (for yell.com it may change for .co.uk)

clear list(%containers)
add list to list(%containers,$scrape attribute(<(tagname="div" AND class=w"*js-LocalBusiness")>,"outerhtml"),"Delete","Global")
clear table(&data)
set(#row,0,"Global")
loop($list total(%containers)) {
    set(#container,$next list item(%containers),"Global")
    set(#businessName,$find regular expression(#container,"(?<=itemprop=\\\"name\\\"\\>).*?(?=\\<\\/h2)"),"Global")
    set table cell(&data,#row,0,#businessName)
    set(#telephone,$find regular expression(#container,"(?<=itemprop=\\\"telephone\\\"\\>).*?(?=\\<\\/)"),"Global")
    set table cell(&data,#row,1,#telephone)
    increment(#row)
}
  • Like 1
Link to post
Share on other sites

Follow up comment: that answer was so very helpful. It took me a while to understand what was going on (no exposure to regex before now) but with some research and plodding through I finally got it, and now have a working bot. More importantly, I get the principle and have been able to make a few other bots with the same idea. Thanks again!

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...