Jump to content
UBot Underground

Is it possible to remove part of a scraped url


Recommended Posts

Hi

 

Im scraping urls from a certain site - the site has a number of duplicate listings but with different urls. I have noticed that the primary url for the listing ends with html and other versions have a ? extension with other characters after it -

 

The common thing that I have found which I would like to remove is everything after the .html so everything from ? to the end.

 

is there a way I can do this - so when i add to file - url - I want to remove the ? and everything after it before adding it.

 

I have a feeling i would need to set the url as a variable first - then replace everything after .html with nothing - then add the variable to file but not a 100% sure

 

if anyone has any experience with this i would love to hear from you

 

thanks

Link to post
Share on other sites

You need to use regex, since I don't think the replace function supports doing this without regex.

 

One way is to strip /? and everything after it. The / escapes the ? because ? is a special character in regex. By putting / before ? it is telling the function to treat ? like a regular character and not like a special character.

 

Another thing you might be able to do is to strip html and everything after it (if html is always the file type) and then add html back on. This insures you get all trailing variables however they are attached, including anchors like index.html#contact.

 

I'm not a pro at this but there are a couple guys around who are, so they may be able to help.

Link to post
Share on other sites

I have no experience in doing this, but have had a need a couple of times before. In theory I would think that simply adding scraped content to a list and then using $replace against the list itself (or even possibly a saved file of the list's content) should work.

 

The caveat to this is I believe the asterisk would have to be recognized as a wildcard outside of the "choose by attribute" function (i.e in all other parameter settings), which I have no idea whether or not it is.

 

search-->?*

 

replace-->[nothing]

 

It would be a nice, clean "Seek and Destroy" function if it were that simple. I'll have to try it out.

 

John

Link to post
Share on other sites

Hi!

 

The version from John doesn't work afaik but this one will work:

 

You can also choose the "?" as a delimeter and don't add the ".html" to #url

But that's the version I use because some sites use structures like

"site.com/file.html/SESSIONID=x/othervariable=y/

url.ubot

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...