beatngu 65 Posted April 16, 2013 Report Share Posted April 16, 2013 I have problems with getting a regex right.Is there a way to get this regex i created to only scrape 1 picture url instead of all?(http://)\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}/([0-9-]+)/([A-Za-z0-9-]+).jpg When i scrape using this i get a list looking like this: http://www.digitale-weegschalen.nl/10-51/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg http://www.digitale-weegschalen.nl/10-53/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg http://www.digitale-weegschalen.nl/10-52/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg http://www.digitale-weegschalen.nl/10-50/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg http://www.digitale-weegschalen.nl/11-57/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-55/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-60/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-58/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-56/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-54/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-61/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/11-59/2-inch-touch-screen-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/12-62/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/12-67/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/12-65/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/12-64/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/12-63/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/12-68/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/12-66/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg http://www.digitale-weegschalen.nl/13-71/draagbare-digitale-keuken-weegschaal.jpg http://www.digitale-weegschalen.nl/13-69/draagbare-digitale-keuken-weegschaal.jpg http://www.digitale-weegschalen.nl/13-72/draagbare-digitale-keuken-weegschaal.jpg http://www.digitale-weegschalen.nl/13-70/draagbare-digitale-keuken-weegschaal.jpg http://www.digitale-weegschalen.nl/14-73/digitale-sieraden-weegschaal.jpg http://www.digitale-weegschalen.nl/14-76/digitale-sieraden-weegschaal.jpg http://www.digitale-weegschalen.nl/14-74/digitale-sieraden-weegschaal.jpg http://www.digitale-weegschalen.nl/14-75/digitale-sieraden-weegschaal.jpg http://www.digitale-weegschalen.nl/15-78/opvouwbare-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/15-80/opvouwbare-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/15-79/opvouwbare-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/15-77/opvouwbare-pocket-weegschaal.jpg http://www.digitale-weegschalen.nl/16-83/pak-sigaretten-digitale-weegschaal.jpg http://www.digitale-weegschalen.nl/16-84/pak-sigaretten-digitale-weegschaal.jpg http://www.digitale-weegschalen.nl/16-81/pak-sigaretten-digitale-weegschaal.jpg http://www.digitale-weegschalen.nl/16-82/pak-sigaretten-digitale-weegschaal.jpg http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg http://www.digitale-weegschalen.nl/17-88/iphone-weegschaal.jpg http://www.digitale-weegschalen.nl/17-87/iphone-weegschaal.jpg http://www.digitale-weegschalen.nl/17-85/iphone-weegschaal.jpg So is there a way with regex to only get 1 ,jpg from each product?Or is there any other way to clean up my list? Quote Link to post Share on other sites
Legend 181 Posted April 16, 2013 Report Share Posted April 16, 2013 You try something like this: ([/])([A-Za-z _-]|%20|[0-9])+\w+(.jpg) or this: [a-zA-Z0-9-_()%20\.]+.(\.jpg) Quote Link to post Share on other sites
beatngu 65 Posted April 17, 2013 Author Report Share Posted April 17, 2013 Hi Duane,I tried this and it will give me the last part of the url.What i need is the whole url but just only 1 time pr product.The regex i posted will give me the whole url but i would need something like this: http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg instead of this : http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg http://www.digitale-weegschalen.nl/17-88/iphone-weegschaal.jpg http://www.digitale-weegschalen.nl/17-87/iphone-weegschaal.jpg http://www.digitale-weegschalen.nl/17-85/iphone-weegschaal.jpg I suck at regex and im not sure if it is even possible to get what i want.I am scraping from the sitemap of the website shown above,but maybe there is a simpler way to do what i want?What i need from the sitemap is the full picture url but only 1 time pr product. Quote Link to post Share on other sites
VaultBoss 310 Posted April 17, 2013 Report Share Posted April 17, 2013 While each link may point to the same image, in fact the links ARE different, so any suitable REGEX would return them all. Even if you focus the REGEX on the last (common to more than one link) part of the URL, the regex would still select them all.Probably when you add them to a list in UBS, with the advanced option to "Delete" dupes, they got deleted and only the first instance was kept, but at the expense of not getting the first part of the URL stored... You cannot construct a regex that would not follow the same rules for each row of data!In other words, you cannot have the regex select the link first time and 'forget' about the rest... unless they would be identical (if so, you could specify how many times you would like the regex to repeat the search). BUT, I'll say again, your URLs are different, all of them.So it is just normal that the regex would return all the results. However, once you got them loaded into a list in UBS, you could LOOP through the list and save to another list each element that is different from the one before it, using only the last part of the URL for comparison. I cannot imagine any other method to obtain the results that you want. If anyone could prove me wrong, I'd be happy to learn how, but afaik you won't be able to achieve what you seek from a single pass with regex, directly, w/o extra manipulations. Hope this helps... Quote Link to post Share on other sites
Anonym 53 Posted April 17, 2013 Report Share Posted April 17, 2013 Why not a combination of regexps and lists where duplicates are deleted? You could start with something like this:(I put the links in a file for test purposes) ui open file("Links", #linksFile) clear list(%fileLinks) add list to list(%fileLinks, $find regular expression($list from text($list from file(#linksFile), "||"), "(?<=http:\\/\\/www\\.digitale-weegschalen\\.nl/[0-9-]+/).+\\.jpg"), "Delete", "Global") If you need an even more generic regexp, then you could use: (?<=http:\/\/.*/[0-9-]+/).+\.jpg Then you could loop a list containing all links and pull out the complete link for each respective unique image stored in the list %fileLinks . Or why not add the full list of links to a table and do a table search for the occurances in the %fileLinks list.It's not optimal, but either method will work. As mentioned by VaultBoss, one single strike with a regexp is impossible AFAIK. 1 Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.