Multithread Funky Behavior When Starting...

DjProg · April 9, 2016

Hello guys,

I have a somewhat funky behavior when running my multithread script here :

reset browser
clear cookies
ui drop down("Max threads","1,2,3,4,5,6,7,8,9,10",#max_threads)
ui block text("URLs to crawl",#ui_URLs)
clear list(%urls)
add list to list(%urls,$list from text(#ui_URLs,$new line),"Delete","Global")
set(#url_crawling_position,"-1","Global")
set(#used_threads,0,"Global")
loop($list total(%urls)) {
    loop while($comparison(#used_threads,">= Greater than or equal to",#max_threads)) {
        wait(1)
    }
    loop_process()
}
define loop_process {
    increment(#used_threads)
    increment(#url_crawling_position)
    scraping_procedure()
}
define scraping_procedure {
    thread {
        in new browser {
            set(#navigate_url,$list item(%urls,#url_crawling_position),"Local")
            navigate(#navigate_url,"Wait")
            wait(5)
            decrement(#used_threads)
            log("End crawling>>> {#navigate_url}")
        }
    }
}

If I set the Max threads at 3...

And run with this set of test URLs :

http://www.booking.com/

https://www.airbnb.com

It'll "skip" google.com and amazon.com to crawl 3 times yahoo.com instead... as you can see in my log:

2016-04-09 13:57:32 [LOG] End crawling>>> http://yahoo.com

2016-04-09 13:57:34 [LOG] End crawling>>> http://yahoo.com

2016-04-09 13:57:38 [LOG] End crawling>>> http://yahoo.com

2016-04-09 13:57:42 [LOG] End crawling>>> http://bing.com

2016-04-09 13:57:54 [LOG] End crawling>>> http://ebay.com

2016-04-09 13:57:54 [LOG] End crawling>>> http://www.booking.com/

2016-04-09 13:58:03 [LOG] End crawling>>> https://www.airbnb.com

Any idea where I screwed up ? I really don't see how can this happen as I'm incrementing the url_crawling_position BEFORE the thread and navigate... so it shouldn't have the same value 3 times

Thanks a lot,

Cheers,

DjProg · April 9, 2016

Tested again with 3 threads but a bigger test URLs list :

http://www.booking.com/

https://www.airbnb.com

http://www.alexa.com/

https://login.live.com/

https://uk.linkedin.com/

https://www.mozilla.org/

http://www.apple.com/

http://www.linux.org/

And it's the complete havoc:

2016-04-09 14:16:46 [LOG] End crawling>>> http://yahoo.com

2016-04-09 14:16:47 [LOG] End crawling>>> http://yahoo.com

2016-04-09 14:16:49 [LOG] End crawling>>> http://yahoo.com

2016-04-09 14:16:55 [LOG] End crawling>>> http://bing.com

2016-04-09 14:17:04 [LOG] End crawling>>> http://www.booking.com/

2016-04-09 14:17:04 [LOG] End crawling>>> http://ebay.com

2016-04-09 14:17:16 [LOG] End crawling>>> https://www.airbnb.com

2016-04-09 14:17:16 [LOG] End crawling>>> https://login.live.com/

2016-04-09 14:17:17 [LOG] End crawling>>> https://login.live.com/

Tons of missed URLs, and instead duplicates of URLs... and it's not specifically at the "start" of the multithreading.

I would bet I'm not the first one to have this kind of behavior, any tips ?

Thanks,

Bot-Factory · April 9, 2016

After your call:

scraping_procedure()

put a wait 0.5

You need a short pause between launching threads. otherwise Ubot can get confused.

Dan

DjProg · April 9, 2016

Thanks Dan.

I've had to put 1 second instead of 0.5 but it's working now (well only the multithread issue I had... now I have the browser hanging / not responding after visiting a few dozens sites... "the white browser forever loading wheel of death" I've opened a ticket for this as it doesn't seem normal at all)

Cheers,

Sign In

Multithread Funky Behavior When Starting...

Recommended Posts

DjProg 3

Link to post

Share on other sites

DjProg 3

Link to post

Share on other sites

Bot-Factory 602

Link to post

Share on other sites

DjProg 3

Link to post

Share on other sites

Join the conversation

Browse

Activity