Jump to content



Photo

Website Crawler 3 - An Efficient Web Crawler For Ubot


  • Please log in to reply
4 replies to this topic

#1 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2977 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 04 December 2018 - 11:31 PM

Website Crawler 3 is a powerful application capable of crawling hundreds of thousands of pages looking for information.

 

 

93PDr.jpg

 

V1 users – how to upgrade.

 

Case Study #1 Powering A Search Engine:

 

 

93PDr.jpg

 

Case Study #2 Domain Hunting For SEO

 

 

Besides the search engine idea, you could rank the recipes and create a site of only the best recipes (like 4.5 stars or higher). You could create your own rating system based on the star rating, how many people made it, how many reviews etc. You could also rate the difficulty based on the number of ingredients and how long it takes to make. This would allow you to make a site with only the easiest recipes (and maybe the highest rated easy recipes).

This is source code. Once you purchase Website Crawler 3 it’s yours to modify, skin, sell – do whatever you want with it.

You should have the Pro or Dev version of Ubot Studio. You should also have the Heopas Plugin and File Management Plugin. Both are free. Links are included in the “Read Me First” file.

  • Capable of scraping 100k+ pages
  • Multithreaded
  • Extremely memory efficient
  • Powerful internal URL detection
  • Easy to modify to scrape whatever you want!
  • Ability to save huge amounts of results without Ubot memory spiking
  • Ability to crawl multiple sites back to back
  • Add in proxies (supports username:password proxies too!)
  • Whitelist filter (must contain) for URL, Title or Innertext
  • Blacklist filter (must not contain) for URL, Title, or Innertext
  • Ability to use both filters at once!
  • Scrape by regex, you don’t even have to modify the code to use this
  • Built-in regular expressions that you can modify for Emails, Phone Numbers, Files, Proxies
  • Ability to remove query if you want (that http://website.com/?this_is=the_query)
  • Powerful extension filter built in which filters pages that can cause problems, you can modify this right in the UI
  • Set a custom user agent if you want

93PDr.jpg



#2 drewness

drewness

    Advanced Member

  • Members
  • PipPipPip
  • 112 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 05 December 2018 - 05:42 AM

Excellent work as always. As a user of v2, I've tested v3 and it's a massive upgrade. It's smoother and more efficient, and I love the switch to a DB vs files for storing data.

If you don't have it already it's definitely worth picking up, the potential you have with it is unlimited - plus you can learn a lot just from looking at the source code.

Highly recommended! Thanks for putting out such a stellar engine, it's a huge time saver for new projects.



A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way!

Sent from my SAMSUNG-SM-N920A using Tapatalk

#3 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2977 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 05 December 2018 - 12:28 PM

A Feature Request: would it be possible to add stop/pause/resume so you can temporarily pause it and resume, or shut it down and open it back up later and pick up where it left off. I'm sure that would be tricky, but would be a powerful feature to add. If not no problem, thanks either way!

 

Yes it's possible, I knew that this was going to be a feature request and so I tried to make sure this would be something that could be done. Right now it's not set up for it but I will try to get this into one of the first couple of updates.



#4 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2977 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 07 December 2018 - 11:56 PM

Update:

 

V 3.1
  • Added: "Exclude Subdomains" option allows you to ignore subdomains while crawling
  • Fixed: Bug where a redirect to a foreign domain could cause crawling of that domain
  • Fixed: Bug where commas would create extra columns in output CSV file

Login here to download: https://elitebotters.com/my-account/



#5 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2977 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 14 December 2018 - 02:58 PM

Update:

 

V 3.2
  • Added: Proxy checker (separate program)
  • Added: Images regex for popular image formats
  • Updated: Files regex to include many file formats
  • Fixed: Email regex would sometimes pick up images

Login here to download: https://elitebotters.com/my-account/

 

The proxy checker is just a little script that tells you what your public IP address is, if the proxy gets an error or not and is just useful to see if your proxies will work with the program. It comes with source and compiled since it does use the Datagrid plugin.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users