HelloInsomnia 1103 Posted December 5, 2018 Report Share Posted December 5, 2018 Website Crawler 3 is a powerful application capable of crawling hundreds of thousands of pages looking for information. V1 users – how to upgrade. Case Study #1 Powering A Search Engine: Case Study #2 Domain Hunting For SEO Besides the search engine idea, you could rank the recipes and create a site of only the best recipes (like 4.5 stars or higher). You could create your own rating system based on the star rating, how many people made it, how many reviews etc. You could also rate the difficulty based on the number of ingredients and how long it takes to make. This would allow you to make a site with only the easiest recipes (and maybe the highest rated easy recipes).This is source code. Once you purchase Website Crawler 3 it’s yours to modify, skin, sell – do whatever you want with it.You should have the Pro or Dev version of Ubot Studio. You should also have the Heopas Plugin and File Management Plugin. Both are free. Links are included in the “Read Me First” file.Capable of scraping 100k+ pagesMultithreadedExtremely memory efficientPowerful internal URL detectionEasy to modify to scrape whatever you want!Ability to save huge amounts of results without Ubot memory spikingAbility to crawl multiple sites back to backAdd in proxies (supports username:password proxies too!)Whitelist filter (must contain) for URL, Title or InnertextBlacklist filter (must not contain) for URL, Title, or InnertextAbility to use both filters at once!Scrape by regex, you don’t even have to modify the code to use thisBuilt-in regular expressions that you can modify for Emails, Phone Numbers, Files, ProxiesAbility to remove query if you want (that http://website.com/?this_is=the_query)Powerful extension filter built in which filters pages that can cause problems, you can modify this right in the UISet a custom user agent if you want Quote Link to post Share on other sites
drewness 26 Posted December 5, 2018 Report Share Posted December 5, 2018 Excellent work as always. As a user of v2, I've tested v3 and it's a massive upgrade. It's smoother and more efficient, and I love the switch to a DB vs files for storing data. If you don't have it already it's definitely worth picking up, the potential you have with it is unlimited - plus you can learn a lot just from looking at the source code. Highly recommended! Thanks for putting out such a stellar engine, it's a huge time saver for new projects. A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way! Sent from my SAMSUNG-SM-N920A using Tapatalk 1 Quote Link to post Share on other sites
HelloInsomnia 1103 Posted December 5, 2018 Author Report Share Posted December 5, 2018 A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way! Yes it's possible, I knew that this was going to be a feature request and so I tried to make sure this would be something that could be done. Right now it's not set up for it but I will try to get this into one of the first couple of updates. 1 Quote Link to post Share on other sites
HelloInsomnia 1103 Posted December 8, 2018 Author Report Share Posted December 8, 2018 Update: V 3.1Added: "Exclude Subdomains" option allows you to ignore subdomains while crawlingFixed: Bug where a redirect to a foreign domain could cause crawling of that domainFixed: Bug where commas would create extra columns in output CSV fileLogin here to download: https://elitebotters.com/my-account/ Quote Link to post Share on other sites
HelloInsomnia 1103 Posted December 14, 2018 Author Report Share Posted December 14, 2018 Update: V 3.2 Added: Proxy checker (separate program) Added: Images regex for popular image formats Updated: Files regex to include many file formats Fixed: Email regex would sometimes pick up images Login here to download: https://elitebotters.com/my-account/ The proxy checker is just a little script that tells you what your public IP address is, if the proxy gets an error or not and is just useful to see if your proxies will work with the program. It comes with source and compiled since it does use the Datagrid plugin. Quote Link to post Share on other sites
HelloInsomnia 1103 Posted December 19, 2018 Author Report Share Posted December 19, 2018 Update: V 3.3 Added: HTML decoding of pages Removed: URLs to lowercase Removed: Decode internal URLs The reason I removed those bits from normalization of the URLs is becasue I feel that most sites do not throw in any kind of random capitalization and actually in some edge cases capitalization matters. So by making the URLs all lowercase it was actually making it harder to scrape (some) sites. Decoding URLs also doesn't offer enough benefit I feel and it can also cause some errors. Login here to download: https://elitebotters.com/my-account/ Quote Link to post Share on other sites
afkratien 12 Posted December 20, 2018 Report Share Posted December 20, 2018 what is " HTML decoding of pages" ?thank Quote Link to post Share on other sites
HelloInsomnia 1103 Posted December 20, 2018 Author Report Share Posted December 20, 2018 what is " HTML decoding of pages" ?thank It just takes the page and turns HTML encoded characters into normal looking ones. An example would be turning this & into just this: & or into a space. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.