Jump to content
UBot Underground

Help isolating data in an address


Recommended Posts

I'm trying to scrape some addresses and I'm running into problems b/c the target site doesn't really have any detailed labeling in the code. For example:

 

</strong></a></div>
<div>
    100 Broadway<br />
    Everett, MA 02149<br />
    (617)381-9000<br /><strong>
    2.7 miles away</strong>
</div>
</div>

 

So, I can get the street address (100 Broadway), but I can't figure out how to isolate City, State, Zip, & Phone. I really want to pull all of these elements separately (as opposed to chucking the entire address into 1 column of my CSV). Any tips are greatly appreciated! Thanks.

Link to post
Share on other sites

I've run across this numerous times. Your first step in solving this is changing the attribute type and seeing where the different variables are isolated. For example, the default type might be A, but if you change it to Table, for example, you will get different attribute selections in the parameters popup. I hope that makes sense.

 

When you click in the browser, just above the "Choose by attribute"you will see the A tag, or Div tag, or TD, tag, etc...THAT is what you want to change in order to see different variations of the attribute. 9 times out of ten I have found a solution this way.

 

John

Link to post
Share on other sites

I've run across this numerous times. Your first step in solving this is changing the attribute type and seeing where the different variables are isolated. For example, the default type might be A, but if you change it to Table, for example, you will get different attribute selections in the parameters popup. I hope that makes sense.

 

When you click in the browser, just above the "Choose by attribute"you will see the A tag, or Div tag, or TD, tag, etc...THAT is what you want to change in order to see different variations of the attribute. 9 times out of ten I have found a solution this way.

 

John

 

Thanks for the reply. I stepped back a couple of DIVs and this is what I get:

<DIV class=dealer><DIV class=dealerorder>1</DIV>
<DIV class=dealerinfo>
<DIV class=dealerdetail><A onmouseover="window.status=''; return true;" onmouseout="window.status=''; return true;" href="results.aspx?cs=2&dealer=206944"><STRONG>Acme Cars of Boston </STRONG></A></DIV>
<DIV>100 Broadway<BR>Everett, MA 02149<BR>(617)381-9000<BR><STRONG>2.7 miles away</STRONG> </DIV></DIV>
<DIV class=action>
<DIV class=dealerlinklist><A id=dealerinfolink onmouseover="window.status=''; return true;" onmouseout="window.status=''; return true;" href="javascript:gotolink('results.aspx?cs=2&dealer=206944&position=1');"><IMG id=btn_dealerinfo206944 onmouseover="javascript:document.getElementById('btn_dealerinfo206944').src='/images/tools/dealer-locator/btn_info_on.gif';" onmouseout="javascript:document.getElementById('btn_dealerinfo206944').src='/images/tools/dealer-locator/btn_info.gif';" border=0 src="/images/tools/dealer-locator/btn_info.gif"></A></DIV>
<DIV class=dealerlinklist><A onmouseover="window.status=''; return true;" onmouseout="window.status=''; return true;" onclick="TrackDealerResultRAQClick('206944')" href="/tools/price-quote.aspx?Dealernumber=206944"><IMG id=btn_request206944 onmouseover="javascript:document.getElementById('btn_request206944').src='/images/tools/dealer-locator/btn_requestquote_on.gif';" onmouseout="javascript:document.getElementById('btn_request206944').src='/images/tools/dealer-locator/btn_requestquote.gif';" border=0 src="/images/tools/dealer-locator/btn_requestquote.gif"></A></DIV></DIV>
<DIV class=attribute>
<DIV class=Attroff>Express Service</DIV>
<DIV class=Attroff>Certified Used Dealer</DIV>
<DIV class=Attroff>Internet Certified</DIV></DIV></DIV>

 

....however, I'm still not sure what I can leverage to isolate the City, State, Zip, & Phone.

Thanks.

Link to post
Share on other sites

Thanks for the reply. I stepped back a couple of DIVs and this is what I get:

<DIV class=dealer><DIV class=dealerorder>1</DIV>
<DIV class=dealerinfo>
<DIV class=dealerdetail><A onmouseover="window.status=''; return true;" onmouseout="window.status=''; return true;" href="results.aspx?cs=2&dealer=206944"><STRONG>Acme Cars of Boston </STRONG></A></DIV>
<DIV>100 Broadway<BR>Everett, MA 02149<BR>(617)381-9000<BR><STRONG>2.7 miles away</STRONG> </DIV></DIV>
<DIV class=action>
<DIV class=dealerlinklist><A id=dealerinfolink onmouseover="window.status=''; return true;" onmouseout="window.status=''; return true;" href="javascript:gotolink('results.aspx?cs=2&dealer=206944&position=1');"><IMG id=btn_dealerinfo206944 onmouseover="javascript:document.getElementById('btn_dealerinfo206944').src='/images/tools/dealer-locator/btn_info_on.gif';" onmouseout="javascript:document.getElementById('btn_dealerinfo206944').src='/images/tools/dealer-locator/btn_info.gif';" border=0 src="/images/tools/dealer-locator/btn_info.gif"></A></DIV>
<DIV class=dealerlinklist><A onmouseover="window.status=''; return true;" onmouseout="window.status=''; return true;" onclick="TrackDealerResultRAQClick('206944')" href="/tools/price-quote.aspx?Dealernumber=206944"><IMG id=btn_request206944 onmouseover="javascript:document.getElementById('btn_request206944').src='/images/tools/dealer-locator/btn_requestquote_on.gif';" onmouseout="javascript:document.getElementById('btn_request206944').src='/images/tools/dealer-locator/btn_requestquote.gif';" border=0 src="/images/tools/dealer-locator/btn_requestquote.gif"></A></DIV></DIV>
<DIV class=attribute>
<DIV class=Attroff>Express Service</DIV>
<DIV class=Attroff>Certified Used Dealer</DIV>
<DIV class=Attroff>Internet Certified</DIV></DIV></DIV>

 

....however, I'm still not sure what I can leverage to isolate the City, State, Zip, & Phone.

Thanks.

 

for those parts, try $replacing
with $new line. then you can do $list from text, with $newline as a delimiter, and you'll have each item as a separate list item.

Link to post
Share on other sites

for those parts, try $replacing <BR> with $new line. then you can do $list from text, with $newline as a delimiter, and you'll have each item as a separate list item.

 

Seth,

 

Thanks for the tips. I "think" I follow what you're saying...I'm still a big newb, but will play around w/ what you suggested. One question though, how can I separate the City/State/Zip (ex. Everett, MA 02149) since there aren't any <BR>s in between those? I really need to have that data in separate columns of my final DB. I guess if worst comes to worst I could do some creative Excel cleanup (text to columns) after, but I was hoping to avoid that.

 

Thanks.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...