Jump to content
UBot Underground

Having Trouble With This Multiline Html Extraction


Recommended Posts

Hi

 

I'm looking for advise on solving this little problem. 

I want to extract "Firstname Lastname" from the html example below. I'm no regex expert - tried this 

 

(?<=<h3>).*?(?=</h3>) - also tried using wildcard, but I think it's the linebreaks/whitespace that's throwing me off... 

<h3>
											
		<a href='/person/Firstname+Lastname/1234-City+A?what=12345678&n=1&page=1&sid=a*%5D%5DJEM%25%5CT0%22'>
			
			
			Firstname Lastname													
		</a>
												
</h3>

Thanks

Link to post
Share on other sites

This worked 
set(#a,"<h3>
                                            
        <a href=\'/person/Firstname+Lastname/1234-City+A?what=12345678&n=1&page=1&sid=a*%5D%5DJEM%25%5CT0%22\'>
            
            
            Firstname Lastname                                                    
        </a>
                                                
</h3>","Global")
alert($replace($find regular expression(#a,"(?<=\\w+\\/).*?(?=\\/\\d)"),"+",$new line))

Link to post
Share on other sites

Hi Bill

 

Thank you! I appreciate your help.

 

It works if I just grab the H3 tag with an offset and use that to extract the name. I'm trying to get it more robust now and use an ID tag and scrape the inner HTML from that. But that results in 3 instances of a href that matches your regex, so I in return gets the name 4 times. I can't wrap my head around how to narrow it to only look within the H3 tag - is it possible?

 

If it's any help - heres the code I'm working with

 

The first set(#phone,20302030,"Global") is just to have a valid number to work with.

set(#phone,20302030,"Global")
ui text box("Phone",#phone)
ui stat monitor("Name: ",#name)
ui stat monitor("Address: ",#address)
ui button("Check") {
    navigate("http://118.tdc.dk/search/go?what={#phone}","Wait")
    wait for browser event("DOM Ready","")
    set(#listing0,$scrape attribute(<id="listing0">,"innerhtml"),"Global")
    set(#name,$replace($find regular expression(#listing0,"(?<=\\w+\\/).*?(?=\\/\\d)"),"+",$new line),"Global")
    set(#address,$scrape attribute(<tagname=r"address">,"innertext"),"Global")
}
Link to post
Share on other sites

See if this works for you

 

ui text box("Phone",#phone)
ui stat monitor("Name: ","{%first} {%lsname}")
ui stat monitor("Address: ",%address)
ui button("Check") {
    navigate("http://118.tdc.dk/search/go?what={#phone}","Wait")
    wait for browser event("DOM Ready","")
    clear list(%first)
    clear list(%lsname)
    clear list(%address)
    set(#listing0,$scrape attribute(<href=w"/person/*">,"fullhref"),"Global")
    set(#lastname,$find regular expression(#listing0,"(?<=\\+).*?(?=\\/)"),"Global")
    add list to list(%first,$find regular expression(#listing0,"(?<=person\\/).*?(?=\\+)"),"Delete","Global")
    add list to list(%lsname,$list from text($replace(#lastname,"+"," "),"
"),"Delete","Global")
    add list to list(%address,$scrape attribute(<tagname="address">,"innertext"),"Delete","Global")
}

Link to post
Share on other sites

Thank you Bill for taking the time helping me out. I really appreciate it!

Your code actually worked, but I had some borderline cases, that gave me problems - e.g. if the number belongs to a company, then the href is different, or the result list was more than 1. 

 

I ended up with this code that seems to do the trick in every case

ui text box("Phone",#phone)
ui stat monitor("Name: ",#name)
ui stat monitor("Address: ",#address)
ui button("Check") {
    navigate("http://118.tdc.dk/search/go?what={#phone}","Wait")
    wait for browser event("Page Loaded","")
    set(#listing0,$scrape attribute(<id="listing0">,"innerhtml"),"Global")
    set(#name,$find regular expression(#listing0,"(?<=<h3>)(?s).*?(?=</h3>)"),"Global")
    set(#name,$trim($find regular expression(#name,"(?<=<a .*?>)(?s).*?(?=</a>)")),"Global")
    set(#address,$find regular expression(#listing0,"(?<=<address .*?>)(?s).*?(?=</address>)"),"Global")
}

So I ended up with a solution to scrape id listing0 and perform regex on it to get the information out, and trim the name. I'm a total rookie when it comes to regex so it's like learning latin to me :) 

Link to post
Share on other sites
  • 5 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...