Jump to content
UBot Underground

Need some guidance for cleaning scraped urls


Recommended Posts

Hi Guys

 

im having a little difficulty scraping urls and saving them in the format that I would like to -

 

An example url that I am working with is hxxp://www.warezforum.info/music/1592425-world-dance-hits-2010-a.html

 

If you go to the url you will see that all the urls that are required by me are within the Code: box - I can manage to scrape each url on the page with no issues - however the issue that I am having is in cleaning the urls - when I scrape then save to file it is also scraping the word Code: and then a blank line -

 

anothe url is hxxp://www.warezforum.info/tv-shows/1591961-friday-night-lights-s04e10-hdtv-xvid-lol.html

 

this page is a little more tricky - inside of the Code: box there are also line breaks and headings for each type of link - what I want to do is to scrape all of the links and nothing else - so basically everything that starts with http://

 

just wondering if anyone could give me a hand - as there are line breaks and spacing then i am not able to control where the link will be saved to the csv file -

 

i have also tried using "{1}" where {1} is the links url file when adding it to csv but still it will not retain the form.

 

I have given a example image below to show what I mean - any help would be great - in the csv file I have to columns - the first is for the thread url which is scraped when i navigate to the page and the 2nd column is for the links - however due to the spaces and line breaks the links appear in the first column with the thread url and just the word Code: normally appears in the links column-

 

thanks

 

http://www.bigseotechniques.com/scraperimage.GIF

Link to post
Share on other sites

Ok try this works ok for the first url you gave in the post if you need more help PM me (if it works of course)

 

Thx

 

Pftg4Urls.ubot

 

Hi Thanks alot - Worked great for the first one but not the 2nd - you have given me a great idea with the scrape page - There is so much more control using it and it didnt even strike me to use it - instead ive been using scrape chosen attribute -

 

I really like the way you have managed to remove the lines - Im going through it but really cant figure the entire process out -

 

Would you mind walking me through the coding you have done - Especially the part wher you are using the commands set and replace

 

thanks

Link to post
Share on other sites

Hi

Ive managed to get it to work so that it scrapes all urls and leaves out the word code:

 

One issue that I am having is with the saving to csv - I have 2 columns - one for the thread url and the other for downloaded links - The issue that I face now is that when i save to csv it leaves too many line breaks -

 

I have attached 2 images showing how i am scraping and the other for the populated csv file

 

thanks

 

http://www.bigseotechniques.com/scraperimage1.GIF

http://www.bigseotechniques.com/csvscreenshot.GIF

Link to post
Share on other sites
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...