Help with regex scrape

beatngu · April 16, 2013

^{I have problems with getting a regex right.}

^{Is there a way to get this regex i created to only scrape 1 picture url instead of all?}

(http://)\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}/([0-9-]+)/([A-Za-z0-9-]+).jpg

When i scrape using this i get a list looking like this:

http://www.digitale-weegschalen.nl/10-51/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/10-53/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/10-52/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/10-50/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/11-57/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-55/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-60/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-58/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-56/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-54/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-61/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-59/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/12-62/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-67/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-65/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-64/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-63/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-68/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-66/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/13-71/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/13-69/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/13-72/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/13-70/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-73/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-76/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-74/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-75/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-78/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-80/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-79/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-77/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-83/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-84/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-81/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-82/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-88/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-87/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-85/iphone-weegschaal.jpg

So is there a way with regex to only get 1 ,jpg from each product?

Or is there any other way to clean up my list?

Legend · April 16, 2013

You try something like this:

([/])([A-Za-z _-]|%20|[0-9])+\w+(.jpg)

or this:

[a-zA-Z0-9-_()%20\.]+.(\.jpg)

:rolleyes:

beatngu · April 17, 2013

Hi Duane,

I tried this and it will give me the last part of the url.

What i need is the whole url but just only 1 time pr product.

The regex i posted will give me the whole url but i would need something like this:

http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg

instead of this :

http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-88/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-87/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-85/iphone-weegschaal.jpg

I suck at regex and im not sure if it is even possible to get what i want.

I am scraping from the sitemap of the website shown above,but maybe there is a simpler way to do what i want?

What i need from the sitemap is the full picture url but only 1 time pr product.

VaultBoss · April 17, 2013

While each link may point to the same image, in fact the links ARE different, so any suitable REGEX would return them all.

Even if you focus the REGEX on the last (common to more than one link) part of the URL, the regex would still select them all.

Probably when you add them to a list in UBS, with the advanced option to "Delete" dupes, they got deleted and only the first instance was kept, but at the expense of not getting the first part of the URL stored...

You cannot construct a regex that would not follow the same rules for each row of data!

In other words, you cannot have the regex select the link first time and 'forget' about the rest... unless they would be identical (if so, you could specify how many times you would like the regex to repeat the search).

BUT, I'll say again, your URLs are different, all of them.

So it is just normal that the regex would return all the results.

However, once you got them loaded into a list in UBS, you could LOOP through the list and save to another list each element that is different from the one before it, using only the last part of the URL for comparison.

I cannot imagine any other method to obtain the results that you want.

If anyone could prove me wrong, I'd be happy to learn how, but afaik you won't be able to achieve what you seek from a single pass with regex, directly, w/o extra manipulations.

Hope this helps...

Anonym · April 17, 2013

Why not a combination of regexps and lists where duplicates are deleted?

You could start with something like this:

(I put the links in a file for test purposes)

ui open file("Links", #linksFile)
clear list(%fileLinks)
add list to list(%fileLinks, $find regular expression($list from text($list from file(#linksFile), "||"), "(?<=http:\\/\\/www\\.digitale-weegschalen\\.nl/[0-9-]+/).+\\.jpg"), "Delete", "Global")

If you need an even more generic regexp, then you could use: (?<=http:\/\/.*/[0-9-]+/).+\.jpg

Then you could loop a list containing all links and pull out the complete link for each respective unique image stored in the list %fileLinks . Or why not add the full list of links to a table and do a table search for the occurances in the %fileLinks list.

It's not optimal, but either method will work.

As mentioned by VaultBoss, one single strike with a regexp is impossible AFAIK.

Sign In

Help with regex scrape

Recommended Posts

beatngu 65

Link to post

Share on other sites

Legend 181

Link to post

Share on other sites

beatngu 65

Link to post

Share on other sites

VaultBoss 310

Link to post

Share on other sites

Anonym 53

Link to post

Share on other sites

Join the conversation

Browse

Activity