Need some guidance for cleaning scraped urls

Abs* · July 10, 2010

Hi Guys

im having a little difficulty scraping urls and saving them in the format that I would like to -

An example url that I am working with is hxxp://www.warezforum.info/music/1592425-world-dance-hits-2010-a.html

If you go to the url you will see that all the urls that are required by me are within the Code: box - I can manage to scrape each url on the page with no issues - however the issue that I am having is in cleaning the urls - when I scrape then save to file it is also scraping the word Code: and then a blank line -

anothe url is hxxp://www.warezforum.info/tv-shows/1591961-friday-night-lights-s04e10-hdtv-xvid-lol.html

this page is a little more tricky - inside of the Code: box there are also line breaks and headings for each type of link - what I want to do is to scrape all of the links and nothing else - so basically everything that starts with http://

just wondering if anyone could give me a hand - as there are line breaks and spacing then i am not able to control where the link will be saved to the csv file -

i have also tried using "{1}" where {1} is the links url file when adding it to csv but still it will not retain the form.

I have given a example image below to show what I mean - any help would be great - in the csv file I have to columns - the first is for the thread url which is scraped when i navigate to the page and the 2nd column is for the links - however due to the spaces and line breaks the links appear in the first column with the thread url and just the word Code: normally appears in the links column-

thanks

http://www.bigseotechniques.com/scraperimage.GIF

pftg4 · July 11, 2010

Ok try this works ok for the first url you gave in the post if you need more help PM me (if it works of course)

Thx

Pftg4Urls.ubot

Abs* · July 11, 2010

Ok try this works ok for the first url you gave in the post if you need more help PM me (if it works of course)

Thx

Pftg4Urls.ubot

Hi Thanks alot - Worked great for the first one but not the 2nd - you have given me a great idea with the scrape page - There is so much more control using it and it didnt even strike me to use it - instead ive been using scrape chosen attribute -

I really like the way you have managed to remove the lines - Im going through it but really cant figure the entire process out -

Would you mind walking me through the coding you have done - Especially the part wher you are using the commands set and replace

thanks

Abs* · July 11, 2010

Hi

Ive managed to get it to work so that it scrapes all urls and leaves out the word code:

One issue that I am having is with the saving to csv - I have 2 columns - one for the thread url and the other for downloaded links - The issue that I face now is that when i save to csv it leaves too many line breaks -

I have attached 2 images showing how i am scraping and the other for the populated csv file

thanks

http://www.bigseotechniques.com/scraperimage1.GIF

http://www.bigseotechniques.com/csvscreenshot.GIF

musiclover2010 · September 9, 2010

Thanks a lot for sharing the instruction.

I adore your help.

Sign In

Need some guidance for cleaning scraped urls

Recommended Posts

Abs* 12

Link to post

Share on other sites

pftg4 102

Link to post

Share on other sites

Abs* 12

Link to post

Share on other sites

Abs* 12

Link to post

Share on other sites

musiclover2010 0

Link to post

Share on other sites

Join the conversation

Browse

Activity