Is it possible to remove part of a scraped url

Abs* · June 26, 2010

Hi

Im scraping urls from a certain site - the site has a number of duplicate listings but with different urls. I have noticed that the primary url for the listing ends with html and other versions have a ? extension with other characters after it -

The common thing that I have found which I would like to remove is everything after the .html so everything from ? to the end.

is there a way I can do this - so when i add to file - url - I want to remove the ? and everything after it before adding it.

I have a feeling i would need to set the url as a variable first - then replace everything after .html with nothing - then add the variable to file but not a 100% sure

if anyone has any experience with this i would love to hear from you

thanks

webautomationlab · June 26, 2010

You need to use regex, since I don't think the replace function supports doing this without regex.

One way is to strip /? and everything after it. The / escapes the ? because ? is a special character in regex. By putting / before ? it is telling the function to treat ? like a regular character and not like a special character.

Another thing you might be able to do is to strip html and everything after it (if html is always the file type) and then add html back on. This insures you get all trailing variables however they are attached, including anchors like index.html#contact.

I'm not a pro at this but there are a couple guys around who are, so they may be able to help.

JohnB · June 26, 2010

I have no experience in doing this, but have had a need a couple of times before. In theory I would think that simply adding scraped content to a list and then using $replace against the list itself (or even possibly a saved file of the list's content) should work.

The caveat to this is I believe the asterisk would have to be recognized as a wildcard outside of the "choose by attribute" function (i.e in all other parameter settings), which I have no idea whether or not it is.

search-->?*

replace-->[nothing]

It would be a nice, clean "Seek and Destroy" function if it were that simple. I'll have to try it out.

John

PsychoDad · July 1, 2010

Hi!

The version from John doesn't work afaik but this one will work:

You can also choose the "?" as a delimeter and don't add the ".html" to #url

But that's the version I use because some sites use structures like

"site.com/file.html/SESSIONID=x/othervariable=y/

url.ubot

Sign In

Is it possible to remove part of a scraped url

Recommended Posts

Abs* 12

Link to post

Share on other sites

webautomationlab 21

Link to post

Share on other sites

JohnB 255

Link to post

Share on other sites

PsychoDad 9

Link to post

Share on other sites

Join the conversation

Browse

Activity