Jump to content
UBot Underground

Website Crawler 3 - An Efficient Web Crawler For Ubot


Recommended Posts

Website Crawler 3 is a powerful application capable of crawling hundreds of thousands of pages looking for information.

 

 

93PDr.jpg

 

V1 users – how to upgrade.

 

Case Study #1 Powering A Search Engine:

 

 

93PDr.jpg

 

Case Study #2 Domain Hunting For SEO

 

 

Besides the search engine idea, you could rank the recipes and create a site of only the best recipes (like 4.5 stars or higher). You could create your own rating system based on the star rating, how many people made it, how many reviews etc. You could also rate the difficulty based on the number of ingredients and how long it takes to make. This would allow you to make a site with only the easiest recipes (and maybe the highest rated easy recipes).

This is source code. Once you purchase Website Crawler 3 it’s yours to modify, skin, sell – do whatever you want with it.

You should have the Pro or Dev version of Ubot Studio. You should also have the Heopas Plugin and File Management Plugin. Both are free. Links are included in the “Read Me First” file.

  • Capable of scraping 100k+ pages
  • Multithreaded
  • Extremely memory efficient
  • Powerful internal URL detection
  • Easy to modify to scrape whatever you want!
  • Ability to save huge amounts of results without Ubot memory spiking
  • Ability to crawl multiple sites back to back
  • Add in proxies (supports username:password proxies too!)
  • Whitelist filter (must contain) for URL, Title or Innertext
  • Blacklist filter (must not contain) for URL, Title, or Innertext
  • Ability to use both filters at once!
  • Scrape by regex, you don’t even have to modify the code to use this
  • Built-in regular expressions that you can modify for Emails, Phone Numbers, Files, Proxies
  • Ability to remove query if you want (that http://website.com/?this_is=the_query)
  • Powerful extension filter built in which filters pages that can cause problems, you can modify this right in the UI
  • Set a custom user agent if you want

93PDr.jpg

Link to post
Share on other sites

Excellent work as always. As a user of v2, I've tested v3 and it's a massive upgrade. It's smoother and more efficient, and I love the switch to a DB vs files for storing data.

 

If you don't have it already it's definitely worth picking up, the potential you have with it is unlimited - plus you can learn a lot just from looking at the source code.

 

Highly recommended! Thanks for putting out such a stellar engine, it's a huge time saver for new projects.

 

 

 

A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way!

 

Sent from my SAMSUNG-SM-N920A using Tapatalk

  • Like 1
Link to post
Share on other sites

A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way!

 

Yes it's possible, I knew that this was going to be a feature request and so I tried to make sure this would be something that could be done. Right now it's not set up for it but I will try to get this into one of the first couple of updates.

  • Like 1
Link to post
Share on other sites

Update:

 

V 3.1
  • Added: "Exclude Subdomains" option allows you to ignore subdomains while crawling
  • Fixed: Bug where a redirect to a foreign domain could cause crawling of that domain
  • Fixed: Bug where commas would create extra columns in output CSV file

Login here to download: https://elitebotters.com/my-account/

Link to post
Share on other sites

Update:


 


V 3.2

  • Added: Proxy checker (separate program)
  • Added: Images regex for popular image formats
  • Updated: Files regex to include many file formats
  • Fixed: Email regex would sometimes pick up images

Login here to download: https://elitebotters.com/my-account/


 


The proxy checker is just a little script that tells you what your public IP address is, if the proxy gets an error or not and is just useful to see if your proxies will work with the program. It comes with source and compiled since it does use the Datagrid plugin.


Link to post
Share on other sites

Update:


 


V 3.3

  • Added: HTML decoding of pages
  • Removed: URLs to lowercase
  • Removed: Decode internal URLs

The reason I removed those bits from normalization of the URLs is becasue I feel that most sites do not throw in any kind of random capitalization and actually in some edge cases capitalization matters. So by making the URLs all lowercase it was actually making it harder to scrape (some) sites. Decoding URLs also doesn't offer enough benefit I feel and it can also cause some errors.


 


Login here to download: https://elitebotters.com/my-account/


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...