Jump to content
UBot Underground

Is it possible to loop with regular expressions?


Recommended Posts

Hi,

I'm trying to set up a regular expression string to basically scrape a chunk or HTML between two set points.  However between these two points is an irregular number of lines. Now I've watched the Regex video and I can see a way to do it. However I'm looking for a foolproof way to do it.

 

I tried (.*\n)*

 

and then using a lookahead assertion but it didn't work


Without the lookahead assertion it did work but obviously pulled up everything after the initial string I specified.

 

Can anyone give me any suggestions on how to do this please?

 

Best regards Steve

Link to post
Share on other sites

If RegEx proves to be too difficult to write the way you want it, why don't you split the logic into two or more steps and apply Regex sequentially?

 

What I mean is.. scrape everything FROM one point till the end first and assign the text to a variable.
Then $replace regular expression within the variable (starting with the END you seek) with $nothing and it would clip your text as you want it.

Link to post
Share on other sites

Okay thanks for the responses folks. 

I've tried to attach the file and it won't let me put it on the original post and I can't see any way to attach a file to this response. So here's a section of it below. Basically what I am trying to do is to pull each record from the Yellow Pages using <div class="parentListing which seems to be the only way to guarantee you get all the data for a record. I'm then planning to slice and dice it in Ubot to get the relevant fields I need. This to me also seems to be the way to ensure I get the correct data for each record as a simple scrape puts all the data out of sync where there are records without some details ( eg, a company with no web address ).

 

However the data between the two points has a different number of lines depending on the record. 

 

I've tried setting up an "if then" regex as well like this  <div class="parentListing.*\n(?(?=<div class="parentListing)|(.*\n{0,5}))

 

However it doesn't work. I'm guessing that it doesn't terminate when the "if" comes back as true but just keeps on going. Not sure if there is a way to stop it doing this?

 

 

 

 <div class="parentListing ui-draggable" id="ad6435397__FLE" data-shortlistid="nat6435397" data-natid="6435397">                            <div class=" vcard padding clearfix"> <div class="vcard-header clearfix">   <h2 class="coName  
   inline-nostars
     sl"> <a class="fn org" data-omniture="LIST:COMPANYNAME" title="View Timeless Elegance Bridal Salon Ltd" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-cardiff-CF101BD-6435397/index.html" id="omnitureID2_ad6435397__FLE"> Timeless Elegance Bridal Salon Ltd </a>    </h2>  <div class="rating-container">   <a class="add-review" rel="nofollow" href="http://www.yell.com/reviews/places/addreview/id/6435397" data-omniture="LIST:WRITEREVIEW"> Be the first to add a review <span class="offscreen">Timeless Elegance Bridal Salon Ltd</span> </a>  </div>  </div>                               <div class="vcard-footer clearfix ">                   <div class="address address_with_pin g_pin"><span class="address_pin pin_12" title="Map position 12"></span>                                               <span class="adr">  <a title="View map of Timeless Elegance Bridal Salon Ltd" data-omniture="LIST:ADDRESS" class="tabLink expandLink" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-Cardiff-CF101BD-6435397/map-directions.html">   <span class="street-address">  first floor,  St. Johns Chambers, High St Arcade,     </span> <span class="locality"><strong>Cardiff</strong></span>    , <span class="postal-code">CF10  1BD</span>     </a>  </span>                             <div title="Distance: < 0.1 miles SE" class="direction"> < 0.1   miles   SE </div>   </div>                 <div class="tel-single">                           <span class="tel-nowrap">Tel: <span class="tel">029 2022 9915</span></span>          </div> </div> <div class="business-info clearfix ">         <div class="keywords snippet"><div class="snippet"><strong>“</strong>...and colours,  <strong>bridal</strong>  and bridesmaid... ...size and informal  <strong>bridal</strong>  wear at... ...an in house  <strong>Bridal</strong>  alteration service.... ...you into our <strong>”</strong><a onclick="s.tl(this, 'o', 'SNIPPET:BIP');" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-cardiff-CF101BD-6435397/index.html" class="bipSnippet">More from profile</a></div>          <p>              <a href="/ucs/UcsSearchAction.do?M=&filterDistance=&keywords=bridal+shops&companyName=&searchType=refinedclassification&ooa=&location=cardiff&layout=&auto=&boost=&scrambleSeed=13781029&broaderLocation=&selectedClassification=Bridal+Shops&advanced=true&filterOH=Anyday&useOH=&startTime=09%3A00&endTime=17%3A00&intCam=capclass&atoz=" title="Show only those results in Bridal Shops">Bridal Shops</a>      </p>     </div> <ul class="cta clearfix">    <li class="website"> <a class="url" target="_blank" href="http://www.timelesselegancebridal.co.uk" id="omnitureID2_ad6435397__FLE" title="Visit Timeless Elegance Bridal Salon Ltd's website" data-omniture="FLE:WL"> <span class="cta-icon cta-only"></span>Visit<span class="accessibleHide"> Timeless Elegance Bridal Salon Ltd's</span> website
</a> </li>         <li class="shortlist"><a class="saveListing" title="Add to shortlist" href="javascript:;"><span class="cta-icon cta-only"></span>Add to shortlist</a></li></ul></div> </div>     <div id="notNatad6435397__FLE" class="notNat clearfix">                                                                                                                                                                                      <ul class="tabbed">  <li class="summaryTL"><a class="tabLink" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-Cardiff-CF101BD-6435397/index.html" data-omniture="TAB:ABOUTUS">About us<span class="accessibleHide"> (Timeless Elegance Bridal Salon Ltd)</span></a></li>       <li class="mapTL"><a class="tabLink" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-Cardiff-CF101BD-6435397/map-directions.html" data-omniture="TAB:MAP">Map  & Directions</a><span class="accessibleHide"> (Timeless Elegance Bridal Salon Ltd)</span></li>     <li class="contactTL"> <a class="tabLink" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-Cardiff-CF101BD-6435397/contactus.html" data-omniture="TAB:CONTACTUS">Opening hours<span class="accessibleHide"> (Timeless Elegance Bridal Salon Ltd)</span></a> </li>   </ul>     <div class="ajaxPod" id="ajaxPodad6435397__FLE"></div> </div>  <div class="listing-footer-container"> <div class="listing-footer clearfix offscreen">           <a title="More information" id="omnitureID_ad6435397__FLE" class="tabLink expandLink MoreLink doAjax" href="/b/Timeless+Elegance+Bridal+Salon+Ltd-Bridal+Shops-cardiff-CF101BD-6435397/index.html" target="_top" data-omniture="LIST">More information<span class="accessibleHide">Timeless Elegance Bridal Salon Ltd</span><span class="moreCloseArrow"></span></a>   <ul class="share clearfix"><li><a href="javascript:;" class="email shareEmail"><span></span>Email</a></li></ul></div> </div>    </div> <div class="pusherDiv"></div>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      <div class="parentListing ui-draggable" id="ad3509373__FLE" data-shortlistid="nat3509373" data-natid="3509373">                            <div class=" vcard padding clearfix"> <div class="vcard-header clearfix">   <h2 class="coName  
Edited by smb1970
Link to post
Share on other sites
  1. First, scrape ALL the relevant data you want. Don't overthink and over-complicate the scraping process per-se.
  2. Next, manipulate the string within UBS, getting rid of the unwanted pieces of content.

Good luck!

Link to post
Share on other sites

Thanks for the suggestion. It still doesn't help unless you are implying that I scrape the entire page of html code from the first occurrence of my string to the bottom of the page - as that's the only way given my current understanding of regular expressions. 

Best regards Steve

Link to post
Share on other sites

Okay, after much head scratching and reading of regex pages I finally managed to construct a string

 

(?<=<div class="parentListing ui-draggable")(?s)(.*?)(?=</div> <div class="pusherDiv">)

 

This gets may data out into a list of 15 elements.

 

My next head scratch is how do I parse through each of these 15 strings in order, checking for the relevant fields and setting null values if a field is not present in that element so that at the end I have a nice file with 15 rows with all the data lined up.

 

My main problem lies in the fact that I have no idea how to deal with each element of the array individually, I can't seem to see any commands to take for example mylist[0] and parse it out to address[0], country[0] etc. 

 

Any suggestions on which commands I should start looking at to do this please? 

 

Best regards Steve

Link to post
Share on other sites

It looks like you KNOW programming, but you didn't take the time to watch the UBS tutorial and example videos or read the Help files?

 

Your questions denote that you would know what to do if you'd have watched these first, to learn about UBS's available commands and functions.

 

I understand how exciting can be to dive - head-first - into the seemingly Carribean-blue pool, but beware... water may be too shallow and you could get hurt.

 

Take your time, sacrifice a working day and spend it to look at what (and how) other people have done things already - start again from there.

 

Good luck to you!

 

P.S.

On topic, per se... You need to manipulate the data you scraped.

  • If you keep the scrape as it is (you probably added it to a variable, I guess...) then you need to transform that data into a list using $list from text and the $new line as delimiter.
  • If you change the scraping method, just use ADD LIST TO LIST command instead of the SET command, with the same regex you already used and automa(t/g)ically you will have the list directly.

Once you have the list, loop through it with the LOOP (or LOOP WHILE) command and further manipulate your data, using $replace with $nothing for unwanted pieces of text you want to trim, for EACH list element.

 

And from here... you'll probably know what to do...

Link to post
Share on other sites

Thanks for the tips.

 

Yes I can program and yes I did just jump straight in, I learn better that way, which I agree can be frustrating for the people I ask for support. My apologies if it's annoying.

 

I have watched some of the videos but I don't consume video tuition particularly well, especially not when I would describe it as verbose. If there was a ubot book I would happily sit down and read through it in an hour or two. However watching a 20 minute video to glean information that can be conveyed in a couple of diagrams and a few lines of explanatory text isn't my idea of time well spent. It's one of my pet hates of the internet that more and more information is being shared via video - which delivers information at the speed of the teacher and not the speed of the student.

 

I'll try playing around with the suggestions you've given me.

Best regards Steve

Link to post
Share on other sites

Excellent, much appreciated it would be useful if that was offered as an alternative to the video tutorials ( for example at the top of the video tutorial page ).

 

Best regards Steve

Link to post
Share on other sites

Thanks to everyone that provided assistance on this thread. I've subsequently managed to write my first Ubot - a Yellow Pages scraper which works nicely....until you get to page 10 and realise they are limiting results returning. 

 

Looks like I'll have to learn how to read a file and narrow down the search areas in order to scrape more data.

Best regards Steve

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...