Having Trouble With This Multiline Html Extraction

dyvel · January 2, 2015

Hi

I'm looking for advise on solving this little problem.

I want to extract "Firstname Lastname" from the html example below. I'm no regex expert - tried this

(?<=<h3>).*?(?=</h3>) - also tried using wildcard, but I think it's the linebreaks/whitespace that's throwing me off...

<h3>
											
		<a href='/person/Firstname+Lastname/1234-City+A?what=12345678&n=1&page=1&sid=a*%5D%5DJEM%25%5CT0%22'>
			
			
			Firstname Lastname													
		</a>
												
</h3>

Thanks

Bill · January 2, 2015

This worked
set(#a,"<h3>

<a href=\'/person/Firstname+Lastname/1234-City+A?what=12345678&n=1&page=1&sid=a*%5D%5DJEM%25%5CT0%22\'>


Firstname Lastname
</a>

</h3>","Global")
alert($replace($find regular expression(#a,"(?<=\\w+\\/).*?(?=\\/\\d)"),"+",$new line))

dyvel · January 2, 2015

Hi Bill

Thank you! I appreciate your help.

It works if I just grab the H3 tag with an offset and use that to extract the name. I'm trying to get it more robust now and use an ID tag and scrape the inner HTML from that. But that results in 3 instances of a href that matches your regex, so I in return gets the name 4 times. I can't wrap my head around how to narrow it to only look within the H3 tag - is it possible?

If it's any help - heres the code I'm working with

The first set(#phone,20302030,"Global") is just to have a valid number to work with.

set(#phone,20302030,"Global")
ui text box("Phone",#phone)
ui stat monitor("Name: ",#name)
ui stat monitor("Address: ",#address)
ui button("Check") {
    navigate("http://118.tdc.dk/search/go?what={#phone}","Wait")
    wait for browser event("DOM Ready","")
    set(#listing0,$scrape attribute(<id="listing0">,"innerhtml"),"Global")
    set(#name,$replace($find regular expression(#listing0,"(?<=\\w+\\/).*?(?=\\/\\d)"),"+",$new line),"Global")
    set(#address,$scrape attribute(<tagname=r"address">,"innertext"),"Global")
}

Bill · January 3, 2015

See if this works for you

ui text box("Phone",#phone)
ui stat monitor("Name: ","{%first} {%lsname}")
ui stat monitor("Address: ",%address)
ui button("Check") {
    navigate("http://118.tdc.dk/search/go?what={#phone}","Wait")
    wait for browser event("DOM Ready","")
    clear list(%first)
    clear list(%lsname)
    clear list(%address)
    set(#listing0,$scrape attribute(<href=w"/person/*">,"fullhref"),"Global")
    set(#lastname,$find regular expression(#listing0,"(?<=\\+).*?(?=\\/)"),"Global")
    add list to list(%first,$find regular expression(#listing0,"(?<=person\\/).*?(?=\\+)"),"Delete","Global")
    add list to list(%lsname,$list from text($replace(#lastname,"+"," "),"
"),"Delete","Global")
    add list to list(%address,$scrape attribute(<tagname="address">,"innertext"),"Delete","Global")
}

dyvel · January 3, 2015

Thank you Bill for taking the time helping me out. I really appreciate it!

Your code actually worked, but I had some borderline cases, that gave me problems - e.g. if the number belongs to a company, then the href is different, or the result list was more than 1.

I ended up with this code that seems to do the trick in every case

ui text box("Phone",#phone)
ui stat monitor("Name: ",#name)
ui stat monitor("Address: ",#address)
ui button("Check") {
    navigate("http://118.tdc.dk/search/go?what={#phone}","Wait")
    wait for browser event("Page Loaded","")
    set(#listing0,$scrape attribute(<id="listing0">,"innerhtml"),"Global")
    set(#name,$find regular expression(#listing0,"(?<=<h3>)(?s).*?(?=</h3>)"),"Global")
    set(#name,$trim($find regular expression(#name,"(?<=<a .*?>)(?s).*?(?=</a>)")),"Global")
    set(#address,$find regular expression(#listing0,"(?<=<address .*?>)(?s).*?(?=</address>)"),"Global")
}

So I ended up with a solution to scrape id listing0 and perform regex on it to get the information out, and trim the name. I'm a total rookie when it comes to regex so it's like learning latin to me

deliter · February 1, 2015

(?s)

put this in your regex,it activates the dot,or mutiline matching

Sign In

Having Trouble With This Multiline Html Extraction

Recommended Posts

dyvel 20

Link to post

Share on other sites

Bill 7

Link to post

Share on other sites

dyvel 20

Link to post

Share on other sites

Bill 7

Link to post

Share on other sites

dyvel 20

Link to post

Share on other sites

deliter 203

Link to post

Share on other sites

Join the conversation

Browse

Activity