DjProg 3 Posted April 9, 2016 Report Share Posted April 9, 2016 Hello guys, I have a somewhat funky behavior when running my multithread script here : reset browser clear cookies ui drop down("Max threads","1,2,3,4,5,6,7,8,9,10",#max_threads) ui block text("URLs to crawl",#ui_URLs) clear list(%urls) add list to list(%urls,$list from text(#ui_URLs,$new line),"Delete","Global") set(#url_crawling_position,"-1","Global") set(#used_threads,0,"Global") loop($list total(%urls)) { loop while($comparison(#used_threads,">= Greater than or equal to",#max_threads)) { wait(1) } loop_process() } define loop_process { increment(#used_threads) increment(#url_crawling_position) scraping_procedure() } define scraping_procedure { thread { in new browser { set(#navigate_url,$list item(%urls,#url_crawling_position),"Local") navigate(#navigate_url,"Wait") wait(5) decrement(#used_threads) log("End crawling>>> {#navigate_url}") } } } If I set the Max threads at 3... And run with this set of test URLs : http://google.comhttp://amazon.comhttp://yahoo.comhttp://bing.comhttp://ebay.comhttp://www.booking.com/https://www.airbnb.com It'll "skip" google.com and amazon.com to crawl 3 times yahoo.com instead... as you can see in my log: 2016-04-09 13:57:32 [LOG] End crawling>>> http://yahoo.com2016-04-09 13:57:34 [LOG] End crawling>>> http://yahoo.com2016-04-09 13:57:38 [LOG] End crawling>>> http://yahoo.com2016-04-09 13:57:42 [LOG] End crawling>>> http://bing.com2016-04-09 13:57:54 [LOG] End crawling>>> http://ebay.com2016-04-09 13:57:54 [LOG] End crawling>>> http://www.booking.com/2016-04-09 13:58:03 [LOG] End crawling>>> https://www.airbnb.com Any idea where I screwed up ? I really don't see how can this happen as I'm incrementing the url_crawling_position BEFORE the thread and navigate... so it shouldn't have the same value 3 times Thanks a lot, Cheers, Quote Link to post Share on other sites
DjProg 3 Posted April 9, 2016 Author Report Share Posted April 9, 2016 Tested again with 3 threads but a bigger test URLs list : http://google.comhttp://amazon.comhttp://yahoo.comhttp://bing.comhttp://ebay.comhttp://www.booking.com/https://www.airbnb.comhttp://www.alexa.com/https://login.live.com/https://uk.linkedin.com/https://www.mozilla.org/http://www.apple.com/http://www.linux.org/ And it's the complete havoc: 2016-04-09 14:16:46 [LOG] End crawling>>> http://yahoo.com2016-04-09 14:16:47 [LOG] End crawling>>> http://yahoo.com2016-04-09 14:16:49 [LOG] End crawling>>> http://yahoo.com2016-04-09 14:16:55 [LOG] End crawling>>> http://bing.com2016-04-09 14:17:04 [LOG] End crawling>>> http://www.booking.com/2016-04-09 14:17:04 [LOG] End crawling>>> http://ebay.com2016-04-09 14:17:16 [LOG] End crawling>>> https://www.airbnb.com2016-04-09 14:17:16 [LOG] End crawling>>> https://login.live.com/2016-04-09 14:17:17 [LOG] End crawling>>> https://login.live.com/ Tons of missed URLs, and instead duplicates of URLs... and it's not specifically at the "start" of the multithreading. I would bet I'm not the first one to have this kind of behavior, any tips ? Thanks, Quote Link to post Share on other sites
Bot-Factory 602 Posted April 9, 2016 Report Share Posted April 9, 2016 After your call: scraping_procedure() put a wait 0.5 You need a short pause between launching threads. otherwise Ubot can get confused. Dan Quote Link to post Share on other sites
DjProg 3 Posted April 9, 2016 Author Report Share Posted April 9, 2016 Thanks Dan. I've had to put 1 second instead of 0.5 but it's working now (well only the multithread issue I had... now I have the browser hanging / not responding after visiting a few dozens sites... "the white browser forever loading wheel of death" I've opened a ticket for this as it doesn't seem normal at all) Cheers, Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.