HarryPotter 9 Posted July 6, 2011 Report Share Posted July 6, 2011 hi, i tried searching for this problem but don't really know what to search for and so couldn't find a solution. please help if it's not too much trouble for you so i navigate to this page where i want to scrape a url. the url is in this format in the ubot browser visually: http://domain.com/v1.php?pid=1&hid=12345&t=0(0)&zz=123&jump=<sub_tracking%20id>&par=a1234-q.w.e.r%09a.sdfghj%097 but what i scrape is: http://domain.com/v1.php?pid=1&hid=12345&t=0(0)&zz=123&jump=<sub_tracking[space]id>&par=a1234-q.w.e.r[tab]a.sdfghj[tab]7 [space] and [tab] are added because ubot forum automatically deletes the space and tab within the quote, but it's there then because i want to navigate to my scraped url, i simply do navigate, scraped url variable. however, since some subsitutions occur like: & becomes &< becomes <> becomes >%20 becomes [space]%09 becomes [tab] i think these changes are breaking my url and i so can't navigate to it anymore... what can be done to make the url working again? Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted July 6, 2011 Report Share Posted July 6, 2011 You can set the url in a variable and then replace all the modified characters to its original then navigate. Quote Link to post Share on other sites
HarryPotter 9 Posted July 6, 2011 Author Report Share Posted July 6, 2011 thanks for that. i did replace the %09 by nothing then i realize the urls will vary so greatly that i don't know how many characters i need to replace... i wonder if there is a more universial method to replace all ASCII characters Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted July 6, 2011 Report Share Posted July 6, 2011 Hi can you share the link where you try to scrape these urls? Quote Link to post Share on other sites
HarryPotter 9 Posted July 6, 2011 Author Report Share Posted July 6, 2011 unfortunately that is not my call to make... i don't think my biz partner would be comfortable with sharing it. anyway to replace these characters easily? thanks for the help! Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted July 6, 2011 Report Share Posted July 6, 2011 I asked because there maybe another way to scrape the links without ubot automatically convert certain characters. If [space] converted to these %20 %09 %05 You can use this regex code %[0-9]{2} to replace it into [space] Quote Link to post Share on other sites
JohnB 255 Posted July 6, 2011 Report Share Posted July 6, 2011 Are you scraping these urls from the page itself, or is this the url in the address bar? If they are clickable links click on them and then grab the $url. Quote Link to post Share on other sites
HarryPotter 9 Posted July 6, 2011 Author Report Share Posted July 6, 2011 thanks for the response these links are from $page scrape i tried choosing it by attribute and getting the link, but the result is the same as what i get from $page scrape I asked because there maybe another way to scrape the links without ubot automatically convert certain characters. so it is ubot that converts them into characters? weird as i have not come across this scraping other pages/ links. Quote Link to post Share on other sites
Kreatus (Ubot Ninja) 422 Posted July 6, 2011 Report Share Posted July 6, 2011 so it is ubot that converts them into characters? weird as i have not come across this scraping other pages/ links. I think so, when you use $page scrape. Try to "choose attribute" then $scrape chosen attribute > href. It should scrape the correct url. Edit: I just realized that you already tried "choose attribute" then $scrape chosen attribute . Then I think the last resort there will be replacing every wrong characters. Quote Link to post Share on other sites
HarryPotter 9 Posted July 6, 2011 Author Report Share Posted July 6, 2011 alright... will start doing that thanks man! Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.