Abs* 12 Posted June 26, 2010 Report Share Posted June 26, 2010 Hi Im scraping urls from a certain site - the site has a number of duplicate listings but with different urls. I have noticed that the primary url for the listing ends with html and other versions have a ? extension with other characters after it - The common thing that I have found which I would like to remove is everything after the .html so everything from ? to the end. is there a way I can do this - so when i add to file - url - I want to remove the ? and everything after it before adding it. I have a feeling i would need to set the url as a variable first - then replace everything after .html with nothing - then add the variable to file but not a 100% sure if anyone has any experience with this i would love to hear from you thanks Quote Link to post Share on other sites
webautomationlab 21 Posted June 26, 2010 Report Share Posted June 26, 2010 You need to use regex, since I don't think the replace function supports doing this without regex. One way is to strip /? and everything after it. The / escapes the ? because ? is a special character in regex. By putting / before ? it is telling the function to treat ? like a regular character and not like a special character. Another thing you might be able to do is to strip html and everything after it (if html is always the file type) and then add html back on. This insures you get all trailing variables however they are attached, including anchors like index.html#contact. I'm not a pro at this but there are a couple guys around who are, so they may be able to help. Quote Link to post Share on other sites
JohnB 255 Posted June 26, 2010 Report Share Posted June 26, 2010 I have no experience in doing this, but have had a need a couple of times before. In theory I would think that simply adding scraped content to a list and then using $replace against the list itself (or even possibly a saved file of the list's content) should work. The caveat to this is I believe the asterisk would have to be recognized as a wildcard outside of the "choose by attribute" function (i.e in all other parameter settings), which I have no idea whether or not it is. search-->?* replace-->[nothing] It would be a nice, clean "Seek and Destroy" function if it were that simple. I'll have to try it out. John Quote Link to post Share on other sites
PsychoDad 9 Posted July 1, 2010 Report Share Posted July 1, 2010 Hi! The version from John doesn't work afaik but this one will work: You can also choose the "?" as a delimeter and don't add the ".html" to #urlBut that's the version I use because some sites use structures like"site.com/file.html/SESSIONID=x/othervariable=y/url.ubot Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.