Jump to content
UBot Underground

Recommended Posts

I have problems with getting a regex right.

Is there a way to get this regex i created to only scrape 1 picture url instead of all?

(http://)\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}/([0-9-]+)/([A-Za-z0-9-]+).jpg

 

When i scrape using this i get a list looking like this:

http://www.digitale-weegschalen.nl/10-51/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/10-53/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/10-52/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/10-50/ultradunne-mini-lcd-display-digitale-weegschaal-met-mp3-speler-ontwerp.jpg
http://www.digitale-weegschalen.nl/11-57/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-55/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-60/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-58/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-56/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-54/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-61/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/11-59/2-inch-touch-screen-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/12-62/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-67/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-65/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-64/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-63/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-68/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/12-66/mini-digitale-elektronische-weegschaal-zippo-style-box.jpg
http://www.digitale-weegschalen.nl/13-71/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/13-69/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/13-72/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/13-70/draagbare-digitale-keuken-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-73/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-76/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-74/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/14-75/digitale-sieraden-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-78/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-80/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-79/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/15-77/opvouwbare-pocket-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-83/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-84/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-81/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/16-82/pak-sigaretten-digitale-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-88/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-87/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-85/iphone-weegschaal.jpg

 

So is there a way with regex to only get 1 ,jpg from each product?

Or is there any other way to clean up my list?

Link to post
Share on other sites

Hi Duane,

I tried this and it will give me the last part of the url.

What i need is the whole url but just only 1 time pr product.

The regex i posted will give me the whole url but i would need something like this:

http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg

 

instead of this :

http://www.digitale-weegschalen.nl/17-86/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-88/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-87/iphone-weegschaal.jpg
http://www.digitale-weegschalen.nl/17-85/iphone-weegschaal.jpg

 

I suck at regex and im not sure if it is even possible to get what i want.

I am scraping from the sitemap of the website shown above,but maybe there is a simpler way to do what i want?

What i need from the sitemap is the full picture url but only 1 time pr product.

Link to post
Share on other sites

While each link may point to the same image, in fact the links ARE different, so any suitable REGEX would return them all.

 

Even if you focus the REGEX on the last (common to more than one link) part of the URL, the regex would still select them all.

Probably when you add them to a list in UBS, with the advanced option to "Delete" dupes, they got deleted and only the first instance was kept, but at the expense of not getting the first part of the URL stored...

 

You cannot construct a regex that would not follow the same rules for each row of data!

In other words, you cannot have the regex select the link first time and 'forget' about the rest... unless they would be identical (if so, you could specify how many times you would like the regex to repeat the search).

 

BUT, I'll say again, your URLs are different, all of them.

So it is just normal that the regex would return all the results.

 

However, once you got them loaded into a list in UBS, you could LOOP through the list and save to another list each element that is different from the one before it, using only the last part of the URL for comparison.

 

I cannot imagine any other method to obtain the results that you want.

 

If anyone could prove me wrong, I'd be happy to learn how, but afaik you won't be able to achieve what you seek from a single pass with regex, directly, w/o extra manipulations.

 

Hope this helps...

Link to post
Share on other sites

Why not a combination of regexps and lists where duplicates are deleted?

 

You could start with something like this:

(I put the links in a file for test purposes)

 

 

ui open file("Links", #linksFile)
clear list(%fileLinks)
add list to list(%fileLinks, $find regular expression($list from text($list from file(#linksFile), "||"), "(?<=http:\\/\\/www\\.digitale-weegschalen\\.nl/[0-9-]+/).+\\.jpg"), "Delete", "Global")

 

 

If you need an even more generic regexp, then you could use:  (?<=http:\/\/.*/[0-9-]+/).+\.jpg

 

Then you could loop a list containing all links and pull out the complete link for each respective unique image stored in the list %fileLinks . Or why not add the full list of links to a table and do a table search for the occurances in the %fileLinks list.

It's not optimal, but either method will work.

 

As mentioned by VaultBoss, one single strike with a regexp is impossible AFAIK.

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...