cujo56 2 Posted January 16, 2014 Report Share Posted January 16, 2014 I have a chunk of HTML where I am trying to pull out the introduction paragraph blocks. Or exclude the travel details if you want to look at it that way. The problem is that the number of product detail lines will vary. There could be 2 lines or there could be 10 lines depending on the travel page. Sample code: <p class="CenterBodyText"> <strong>Duration:</strong> 9 days<br> <strong>3 Travelers:</strong> $4,250 per person <br> <strong>2 Travelers:</strong> $4,550 per person <br> At sed graece putant, et altera maluisset ius. Ea inani viris his, eu vide altera vim, ut pro veri mucius. Inani tation graeco id qui, te has mazim volumus. Agam dictas vel no, habeo soluta suavitate qui an, cu mea omnis tollit definiebas. Et nec quaeque volutpat iudicabit, vel ei fierent scripserit. Id usu quod corrumpit definiebas, qui case latine in, eu vis eius habeo animal. In his quod laoreet, elitr accusata mel in. Homero munere conceptam quo ne. Ne usu agam vitae, at suas argumentum per. Dicit errem accusam sit ne, his saepe veniam laboramus ne, atqui recusabo accommodare an eum. <br> <strong style="display:block;padding-top:10px;">Private departures for at suas argumentum per. Dicit errem accusam<script src="/js/RightMarginContact.js" type="text/javascript"></script><a href="/contact.htm" onclick="this.blur();" style="color:#330000;">Contact Us</a> to find out more.</strong> </p> This is the part I am trying to remove: <strong>Duration:</strong> 9 days<br> <strong>3 Travelers:</strong> $4,250 per person <br> <strong>2 Travelers:</strong> $4,550 per person <br> The "Duration" line should be typical except for the number of days or range of days but all on the same line. There can be any number of lines for "Travelers" and pricing before getting to the meat of the text block. I have tried several approaches using conditional regular expressions but can't get it to work. Not sure how to approach this. I am using the Standard Edition of Ubot. Thanks for your help. Joe PS in perfect world I want to pull out The Duration line and the Travel Pricing lines as separate lists as well. Quote Link to post Share on other sites
JDJ 1 Posted January 17, 2014 Report Share Posted January 17, 2014 Do you need to use a regular expression for this? It looks like each travel information paragraph itself contains no newlines - so each separate item is deliminated by a newline? You could create a list based on everything between the outside paragraph tags with a newline as the delimiter, then remove any line that starts with <br> or <strong> . Quote Link to post Share on other sites
k1lv9h 76 Posted January 17, 2014 Report Share Posted January 17, 2014 Hi, Sample code:sample-travel-extract-001.ubot Kevin Quote Link to post Share on other sites
cujo56 2 Posted January 17, 2014 Author Report Share Posted January 17, 2014 Awesome Kevin! Thanks! That works. For what it's worth to the community and those that follow the dark art of RegEx I have been hammering on this Expression based on JDJ's recommendation. This selects everything between the primary tag. (?<=<p class="CenterBodyText">)([\s\S]*?)(?=</p>) This was a concept to excluded lines with <strong> or <br>. I couldn't find and an expression that worked for the exclusion. (?<=<p class="CenterBodyText">)([\s\S]*?)(?!SOMETHING MAGICAL HERE TO EXCLUDE LINES WITH <STRONG> OR <BR>)(?=</p>) RegEx is like a really pretty woman that wants nothing to do with you. You see her walking across the room but you'll never touch her. Thanks again, Joe Quote Link to post Share on other sites
JDJ 1 Posted January 18, 2014 Report Share Posted January 18, 2014 Glad you got the answer, sorry if I confused you further.. as I was suggesting some logic that could be used to do it without bothering with regular expressions at all - but didn't have time to come up with the code last night. Quote Link to post Share on other sites
cujo56 2 Posted January 18, 2014 Author Report Share Posted January 18, 2014 JDJ - Thanks. Kevin's solution is pretty good and very appreciated. I'm not very strong with RegEx so it's hard to know when to abandon that path. I am thinking that if a RegEx has too many assertions, excludes and conditions then you're just getting into the weeds. Probably too hard to support over time as well. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.