Jump to content
UBot Underground

Multithread Funky Behavior When Starting...


Recommended Posts

Hello guys,

 

I have a somewhat funky behavior when running my multithread script here :

reset browser
clear cookies
ui drop down("Max threads","1,2,3,4,5,6,7,8,9,10",#max_threads)
ui block text("URLs to crawl",#ui_URLs)
clear list(%urls)
add list to list(%urls,$list from text(#ui_URLs,$new line),"Delete","Global")
set(#url_crawling_position,"-1","Global")
set(#used_threads,0,"Global")
loop($list total(%urls)) {
    loop while($comparison(#used_threads,">= Greater than or equal to",#max_threads)) {
        wait(1)
    }
    loop_process()
}
define loop_process {
    increment(#used_threads)
    increment(#url_crawling_position)
    scraping_procedure()
}
define scraping_procedure {
    thread {
        in new browser {
            set(#navigate_url,$list item(%urls,#url_crawling_position),"Local")
            navigate(#navigate_url,"Wait")
            wait(5)
            decrement(#used_threads)
            log("End crawling>>> {#navigate_url}")
        }
    }
}

If I set the Max threads at 3...

 

And run with this set of test URLs :

 

 
It'll "skip" google.com and amazon.com to crawl 3 times yahoo.com instead... as you can see in my log:
 
2016-04-09 13:57:32 [LOG] End crawling>>> http://yahoo.com
2016-04-09 13:57:34 [LOG] End crawling>>> http://yahoo.com
2016-04-09 13:57:38 [LOG] End crawling>>> http://yahoo.com
2016-04-09 13:57:42 [LOG] End crawling>>> http://bing.com
2016-04-09 13:57:54 [LOG] End crawling>>> http://ebay.com
2016-04-09 13:57:54 [LOG] End crawling>>> http://www.booking.com/
2016-04-09 13:58:03 [LOG] End crawling>>> https://www.airbnb.com
 
Any idea where I screwed up ? I really don't see how can this happen as I'm incrementing the url_crawling_position BEFORE the thread and navigate... so it shouldn't have the same value 3 times
 
Thanks a lot,
 
Cheers,
Link to post
Share on other sites

Tested again with 3 threads but a bigger test URLs list :

 

 
And it's the complete havoc:
 
2016-04-09 14:16:46 [LOG] End crawling>>> http://yahoo.com
2016-04-09 14:16:47 [LOG] End crawling>>> http://yahoo.com
2016-04-09 14:16:49 [LOG] End crawling>>> http://yahoo.com
2016-04-09 14:16:55 [LOG] End crawling>>> http://bing.com
2016-04-09 14:17:04 [LOG] End crawling>>> http://www.booking.com/
2016-04-09 14:17:04 [LOG] End crawling>>> http://ebay.com
2016-04-09 14:17:16 [LOG] End crawling>>> https://www.airbnb.com
2016-04-09 14:17:16 [LOG] End crawling>>> https://login.live.com/
2016-04-09 14:17:17 [LOG] End crawling>>> https://login.live.com/
 
Tons of missed URLs, and instead duplicates of URLs... and it's not specifically at the "start" of the multithreading.
 
I would bet I'm not the first one to have this kind of behavior, any tips ?
 
Thanks,
Link to post
Share on other sites

Thanks Dan.

 

I've had to put 1 second instead of 0.5 but it's working now (well only the multithread issue I had... now I have the browser hanging / not responding after visiting a few dozens sites... "the white browser forever loading wheel of death" I've opened a ticket for this as it doesn't seem normal at all)

 

Cheers,

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...