Jump to content
UBot Underground

Conditional Expression Help Pulling out a Text Block


Recommended Posts

I have a chunk of HTML where I am trying to pull out the introduction paragraph blocks.  Or exclude the travel details if you want to look at it that way.  The problem is that the number of product detail lines will vary.  There could be 2 lines or there could be 10 lines depending on the travel page.

 

Sample code:

 

        <p class="CenterBodyText">
            <strong>Duration:</strong> 9 days<br>
            <strong>3 Travelers:</strong> $4,250 per person
            <br>
            <strong>2 Travelers:</strong> $4,550 per person
            <br>
            At sed graece putant, et altera maluisset ius. Ea inani viris his, eu vide altera vim, ut pro veri mucius. Inani tation graeco id qui, te has mazim volumus. Agam dictas vel no, habeo soluta suavitate qui an, cu mea omnis tollit definiebas. Et nec quaeque volutpat iudicabit, vel ei fierent scripserit. Id usu quod corrumpit definiebas, qui case latine in, eu vis eius habeo animal. In his quod laoreet, elitr accusata mel in. Homero munere conceptam quo ne. Ne usu agam vitae, at suas argumentum per. Dicit errem accusam sit ne, his saepe veniam laboramus ne, atqui recusabo accommodare an eum.
            <br>
            <strong style="display:block;padding-top:10px;">Private departures for at suas argumentum per. Dicit errem accusam<script src="/js/RightMarginContact.js" type="text/javascript"></script><a href="/contact.htm" onclick="this.blur();" style="color:#330000;">Contact Us</a> to find out more.</strong>
        </p>

 

This is the part I am trying to remove:

 

 <strong>Duration:</strong> 9 days<br>
            <strong>3 Travelers:</strong> $4,250 per person
            <br>
            <strong>2 Travelers:</strong> $4,550 per person
            <br>

 

The "Duration" line should be typical except for the number of days or range of days but all on the same line.

 

There can be any number of lines for  "Travelers" and pricing before getting to the meat of the text block.

 

I have tried several approaches using conditional regular expressions but can't get it to work.  Not sure how to approach this.

 

I am using the Standard Edition of Ubot.

 

Thanks for your help.

 

Joe

 

PS in perfect world I want to pull out The Duration line and the Travel Pricing lines as separate lists as well.

Link to post
Share on other sites

Do you need to use a regular expression for this?  It looks like each travel information paragraph itself contains no newlines - so each separate item is deliminated by a newline?

 

You could create a list based on everything between the outside paragraph tags with a newline as the delimiter, then remove any line that starts with <br> or <strong> .

Link to post
Share on other sites

Awesome Kevin! Thanks! That works.

 

For what it's worth to the community and those that follow the dark art of RegEx I have been hammering on this Expression based on JDJ's recommendation.

 

This selects everything between the primary tag.

 

(?<=<p class="CenterBodyText">)([\s\S]*?)(?=</p>)

 

This was a concept to excluded lines with <strong> or <br>.  I couldn't find and an expression that worked for the exclusion.

 

(?<=<p class="CenterBodyText">)([\s\S]*?)(?!SOMETHING MAGICAL HERE TO EXCLUDE LINES WITH <STRONG> OR <BR>)(?=</p>)

 

RegEx is like a really pretty woman that wants nothing to do with you. You see her walking across the room but you'll never touch her.

 

Thanks again,

 

Joe

Link to post
Share on other sites

Glad you got the answer, sorry if I confused you further.. as I was suggesting some logic that could be used to do it without bothering with regular expressions at all - but didn't have time to come up with the code last night.

Link to post
Share on other sites

JDJ - Thanks.  Kevin's solution is pretty good and very appreciated.  I'm not very strong with RegEx so it's hard to know when to abandon that path. I am thinking that if a RegEx has too many assertions, excludes and conditions then you're just getting into the weeds.  Probably too hard to support over time as well.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...