Website Crawler 3 - An Efficient Web Crawler For Ubot

HelloInsomnia · December 5, 2018

Website Crawler 3 is a powerful application capable of crawling hundreds of thousands of pages looking for information.

Case Study #1 Powering A Search Engine:

Case Study #2 Domain Hunting For SEO

Besides the search engine idea, you could rank the recipes and create a site of only the best recipes (like 4.5 stars or higher). You could create your own rating system based on the star rating, how many people made it, how many reviews etc. You could also rate the difficulty based on the number of ingredients and how long it takes to make. This would allow you to make a site with only the easiest recipes (and maybe the highest rated easy recipes).

This is source code. Once you purchase Website Crawler 3 it’s yours to modify, skin, sell – do whatever you want with it.

You should have the Pro or Dev version of Ubot Studio. You should also have the Heopas Plugin and File Management Plugin. Both are free. Links are included in the “Read Me First” file.

Capable of scraping 100k+ pages
Multithreaded
Extremely memory efficient
Powerful internal URL detection
Easy to modify to scrape whatever you want!
Ability to save huge amounts of results without Ubot memory spiking
Ability to crawl multiple sites back to back
Add in proxies (supports username:password proxies too!)
Whitelist filter (must contain) for URL, Title or Innertext
Blacklist filter (must not contain) for URL, Title, or Innertext
Ability to use both filters at once!
Scrape by regex, you don’t even have to modify the code to use this
Built-in regular expressions that you can modify for Emails, Phone Numbers, Files, Proxies
Ability to remove query if you want (that http://website.com/?this_is=the_query)
Powerful extension filter built in which filters pages that can cause problems, you can modify this right in the UI
Set a custom user agent if you want

drewness · December 5, 2018

Excellent work as always. As a user of v2, I've tested v3 and it's a massive upgrade. It's smoother and more efficient, and I love the switch to a DB vs files for storing data.

If you don't have it already it's definitely worth picking up, the potential you have with it is unlimited - plus you can learn a lot just from looking at the source code.

Highly recommended! Thanks for putting out such a stellar engine, it's a huge time saver for new projects.

A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way!

Sent from my SAMSUNG-SM-N920A using Tapatalk

HelloInsomnia · December 5, 2018

A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way!

Yes it's possible, I knew that this was going to be a feature request and so I tried to make sure this would be something that could be done. Right now it's not set up for it but I will try to get this into one of the first couple of updates.

HelloInsomnia · December 8, 2018

Update:

V 3.1

Added: "Exclude Subdomains" option allows you to ignore subdomains while crawling
Fixed: Bug where a redirect to a foreign domain could cause crawling of that domain
Fixed: Bug where commas would create extra columns in output CSV file

Login here to download: https://elitebotters.com/my-account/

HelloInsomnia · December 14, 2018

Update:

V 3.2

Added: Proxy checker (separate program)
Added: Images regex for popular image formats
Updated: Files regex to include many file formats
Fixed: Email regex would sometimes pick up images

Login here to download: https://elitebotters.com/my-account/

The proxy checker is just a little script that tells you what your public IP address is, if the proxy gets an error or not and is just useful to see if your proxies will work with the program. It comes with source and compiled since it does use the Datagrid plugin.

HelloInsomnia · December 19, 2018

Update:

V 3.3

Added: HTML decoding of pages
Removed: URLs to lowercase
Removed: Decode internal URLs

The reason I removed those bits from normalization of the URLs is becasue I feel that most sites do not throw in any kind of random capitalization and actually in some edge cases capitalization matters. So by making the URLs all lowercase it was actually making it harder to scrape (some) sites. Decoding URLs also doesn't offer enough benefit I feel and it can also cause some errors.

Login here to download: https://elitebotters.com/my-account/

afkratien · December 20, 2018

what is " HTML decoding of pages" ?
thank

HelloInsomnia · December 20, 2018

what is " HTML decoding of pages" ?
thank

It just takes the page and turns HTML encoded characters into normal looking ones. An example would be turning this & into just this: & or into a space.

Sign In

Website Crawler 3 - An Efficient Web Crawler For Ubot

Recommended Posts

HelloInsomnia 1103

Link to post

Share on other sites

drewness 26

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

afkratien 12

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

Join the conversation

Browse

Activity