Jump to content
UBot Underground

Scraping top level URL from yahoo site explorer


Recommended Posts

Hi,

 

I'm trying to make a bot that works out how many different domains are linking to a particular site rather than the number of backlinks in total. I'm having a hard time trying to figure out how to scrape a URL from yahoo site explorer and then trim it to root (i.e. bbc.co.uk/sport would end up as bbc.co.uk/).

 

I then need to remove any duplicates from the list - is there any way to do this?

 

I've been searching for a solution for a while so sorry if this has already been posted.

 

Thanks

Link to post
Share on other sites

Exact duplicates? Loop through them and in each loop add to list -> $next list item "Remove duplicates? Yes"

 

But a javascript eval function would be needed for that, I don't know any javascript though - so you'd have to wait for someone who does.

Link to post
Share on other sites

Thanks, will try that. The main problem now is scraping just the domain name rather than the entire URL, any suggestions for how to do that?

 

I've tried using wildcards and replace to try and strip the rest of the URL leaving just the domain name but can't get it to work unfortunately.

Link to post
Share on other sites

Thanks, will try that. The main problem now is scraping just the domain name rather than the entire URL, any suggestions for how to do that?

 

I've tried using wildcards and replace to try and strip the rest of the URL leaving just the domain name but can't get it to work unfortunately.

Take a look at the fix I used on turbolapp's bot here: http://ubotstudio.com/forum/index.php?/topic/3147-keyword-ranking-bot/page__view__findpost__p__10399

 

You should be able to modify the javascript to do what you want, but I think you would need a list of every TLD since it only works for single dot TLDs (.com, .net, .info, .ca, etc) and not intl TLDs like .co.uk etc with multiple dots.

Link to post
Share on other sites

When I trim to root, I have done the following.

 

http://www.domain.co.au/monkeyfaces.cfm

 


  1.  
  2. First, change // to ::
  3. Then strip /.*
  4. Then change ::: to ://
     

done.

 

That will also work for http and https, because it doesn't touch that portion of the url if it exists.

 

Then once you have your domains stripped down, I think ubot has a delete dupes function.

Link to post
Share on other sites

When I trim to root, I have done the following.

 

http://www.domain.co.au/monkeyfaces.cfm

 


  1.  
  2. First, change // to ::
  3. Then strip /.*
  4. Then change ::: to ://
     

done.

 

That will also work for http and https, because it doesn't touch that portion of the url if it exists.

 

Then once you have your domains stripped down, I think ubot has a delete dupes function.

Thanks webautomationlab,

 

I just made a bot with you example... worked great.

 

Here it is:

trim to root.ubot

  • Like 1
Link to post
Share on other sites
  • 1 month later...

Thanks webautomationlab,

 

I just made a bot with you example... worked great.

 

Here it is:

 

Hi Bluegoat

 

Ive looked over your code and have tried to implement it on a bot created and shared by Turbolapp -

 

For the life of me i cant get it to work - the domain is a .co.uk so i need the coding you have used but for some reason it doesnt recognize it or im adding it incorrectly - not too sure.

 

I have added the file below hoping that maybe you or someone else could take a look and then advise.

 

Thanks alot

googlerankchecker.ubot

Link to post
Share on other sites

Guys im still trying to get this to work and have tried playing with it on a number of occasions - I could really do with some help if anyone wouldnt mind taking out some time

 

thanks

 

abs

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...