scrape IP:PORT notation from page, store in list

Bob The Builder · August 15, 2011

What is the best way to handle this?

I have an account with a provider who provides elite Google proxies on a daily basis (twice a day) but he only provides them by forum post through a private forum and will not deliver in a text file or anything else.

I wrote a bot that logs in, finds all posts in the last 12 hours and displays them.

Now I want to scan the page and scape every occurrence of IP:PORT on the page.

For example: 10.10.10.10:8080

How do I scrape each occurrence one by one into a list variable from the current loaded page?

This bot will run automatically every 12 hours.

Gogetta · August 15, 2011

Using regex along with document text you can pretty much scrape any websites. Include sockets and threads and you scrape a ton of proxies in no time. I added an example below. Good Luck!

Proxy Scraper Example.ubot

Bob The Builder · August 15, 2011

Using regex along with document text you can pretty much scrape any websites. Include sockets and threads and you scrape a ton of proxies in no time. I added an example below. Good Luck!

Thanks! I will check this out and see if I can make it work.

Bob The Builder · August 15, 2011

Thanks, this worked (to some extent, see below) in v3, but v4 (which is where I am coding this script) it doesn't work. I think find regular expression doesn't work yet or something, I posted about it in the v4 section.

I also removed the list functionality completely, and just put the find regular expression in the save to file command as content and it worked.

But I am having very weird results.

If I use your code as is, it finds five proxies on this test site I am using:

http://atomintersoft...oxy/proxy-list/

My modified code where I put your find regular expression node directly into save to file "content" field, I can find 15.

I am not sure what is going on.

Bob The Builder · August 15, 2011

Thanks, this worked (to some extent, see below) in v3, but v4 (which is where I am coding this script) it doesn't work. I think find regular expression doesn't work yet or something, I posted about it in the v4 section.

I also removed the list functionality completely, and just put the find regular expression in the save to file command as content and it worked.

But I am having very weird results.

If I use your code as is, it finds five proxies on this test site I am using:
http://atomintersoft...oxy/proxy-list/

My modified code where I put your find regular expression node directly into save to file "content" field, I can find 15.

I am not sure what is going on.

nm, doesn't look like it isn't removing duplicates in the file.

Sign In

scrape IP:PORT notation from page, store in list

Recommended Posts

Bob The Builder 62

Link to post

Share on other sites

Gogetta 263

Link to post

Share on other sites

Bob The Builder 62

Link to post

Share on other sites

Bob The Builder 62

Link to post

Share on other sites

Bob The Builder 62

Link to post

Share on other sites

Join the conversation

Browse

Activity