url string is messed up after scraping

HarryPotter · July 6, 2011

hi,

i tried searching for this problem but don't really know what to search for and so couldn't find a solution. please help if it's not too much trouble for you

so i navigate to this page where i want to scrape a url. the url is in this format in the ubot browser visually:

http://domain.com/v1.php?pid=1&hid=12345&t=0(0)&zz=123&jump=<sub_tracking%20id>&par=a1234-q.w.e.r%09a.sdfghj%097

but what i scrape is:

http://domain.com/v1.php?pid=1&hid=12345&t=0(0)&zz=123&jump=<sub_tracking[space]id>&par=a1234-q.w.e.r[tab]a.sdfghj[tab]7

[space] and [tab] are added because ubot forum automatically deletes the space and tab within the quote, but it's there

then because i want to navigate to my scraped url, i simply do navigate, scraped url variable. however, since some subsitutions occur like:

& becomes &

< becomes <

> becomes >

%20 becomes [space]

%09 becomes [tab]

i think these changes are breaking my url and i so can't navigate to it anymore... what can be done to make the url working again?

Kreatus (Ubot Ninja) · July 6, 2011

You can set the url in a variable and then replace all the modified characters to its original then navigate.

HarryPotter · July 6, 2011

thanks for that. i did replace the %09 by nothing

then i realize the urls will vary so greatly that i don't know how many characters i need to replace... i wonder if there is a more universial method to replace all ASCII characters

Kreatus (Ubot Ninja) · July 6, 2011

Hi can you share the link where you try to scrape these urls?

HarryPotter · July 6, 2011

unfortunately that is not my call to make... i don't think my biz partner would be comfortable with sharing it.

anyway to replace these characters easily?

thanks for the help!

Kreatus (Ubot Ninja) · July 6, 2011

I asked because there maybe another way to scrape the links without ubot automatically convert certain characters.

If [space] converted to these

%20
%09
%05

You can use this regex code

%[0-9]{2}

to replace it into [space]

JohnB · July 6, 2011

Are you scraping these urls from the page itself, or is this the url in the address bar? If they are clickable links click on them and then grab the $url.

HarryPotter · July 6, 2011

thanks for the response

these links are from $page scrape

i tried choosing it by attribute and getting the link, but the result is the same as what i get from $page scrape

I asked because there maybe another way to scrape the links without ubot automatically convert certain characters.

so it is ubot that converts them into characters?

weird as i have not come across this scraping other pages/ links.

Kreatus (Ubot Ninja) · July 6, 2011

so it is ubot that converts them into characters?

weird as i have not come across this scraping other pages/ links.

I think so, when you use $page scrape.

Try to "choose attribute" then $scrape chosen attribute > href. It should scrape the correct url.

Edit: I just realized that you already tried "choose attribute" then $scrape chosen attribute . Then I think the last resort there will be replacing every wrong characters.

HarryPotter · July 6, 2011

alright... will start doing that

thanks man!

Sign In

url string is messed up after scraping

Recommended Posts

HarryPotter 9

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

HarryPotter 9

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

HarryPotter 9

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

HarryPotter 9

Link to post

Share on other sites

Kreatus (Ubot Ninja) 422

Link to post

Share on other sites

HarryPotter 9

Link to post

Share on other sites

Join the conversation

Browse

Activity