[Case Study] 988% Speed Increase - Working With Large Lists

HelloInsomnia · August 13, 2013

This is just a quick write up of how I began to solve one of the big issues with Ubot - working with large lists. This post details how I took a list of 10k and checked it against a list of 1600 - my first program took nearly 7 minutes to do this task and my final one took only 42 seconds.

==================================================

I knew right away I was in deep trouble. I needed to check very large lists against each other in Ubot - and that means bad news and headaches are in the near future. The problem is that Ubot doesn't really handle large lists all that well. It takes forever to import them - in Ubot Studio itself you can't really import large lists - or at least I can't; I only have 8 gigs of ram.

Compiled bots do run them better though. Importing a list of 100k into Ubot Studio is next to impossible for me - I lose patience after awhile and just kill the process. It doesn't take quite forever in a compiled bot however, maybe 2 minutes. And if it can then run smoothly afterwards then that isn't the end of the world.

But really another massive issue that you have is that Ubot is storing all of this information and that clearly slows things down - as you will soon see. That is the part of the problem I focused on.

My goal: To take a list and filter it against another list. Sort of like if you have a database of links that you have already used and want to take a scraped list and filter out the ones you have already posted to - same idea at least.

I started with the way I've always done it. I simply looped for the list total of list 1 and then inside I looped for the list total of list 2 for each iteration of list 1 checking one keyword against the entire list.

This takes forever.

On small lists you don't notice it quite so much but once you start to get into the thousands it really does take a long time. I had to set a stat monitor to track the progress and also added in a start and stop time so I could track that progress as well.

I checked a list of 10k against a list of 1660. It took 6 minutes and 55 seconds.

At that rate I might as well just give up before I begin - my end goal is to check 1 million against 10 million; ideally.

Enter remove from list:

This was the first breakthrough - realizing that I needed to cut the list size down at every possible opportunity if I was going to speed this thing up. So I set a variable to be list item 0 and then removed list item 0 on every iteration.

5:28 - I had improved the program drastically and from this single change - there were a few other tweaks but really nothing major at all - like removing unused else statements. I had sped it up quite a bit. Yet it was still way too slow.

Multithreading?

Talk about a good idea gone bad. This did nothing but eat memory and really slow everything down overall. I think that because this is such a lightning fast thing and not using the browser or anything that multithreading was just making the program work harder and not allowing it to take the path of least resistance.

The heart of the trick:

One node I had been considering adding in was contains - I like that it just checks the whole list for something all at once and doesn't have to iterate through each item. I knew if I could work that into the program I could shave some serious time.

And that is exactly what happened.

I took the second list - the blacklist we can call it and I looped for the list total of that only checking to see if list 1 contains each item from the blacklist - if it does not contain the item then it is thrown away - if it does then the item is saved to a second blacklist.

I am doing this to cut down on the overall iterations for the second loop. if you have 1660 items in your blacklist but only 100 of them actually match something then you are doing 1560 extra loops for each iteration - in my case 10k or a total of 15,600,000 extra tasks.

Once the blacklist is checked and cut down the original is cleared to save precious resources and make the program as light as possible.

Now the second loop beings to actually do the work and only save off the items that are not in the blacklist. Now we are not wasting any resources and are only looking for items that we know are in both lists.

I know this might be a bit hard to visualize - but the core idea is this:

check blacklist (list 2) against list 1 - use contains so you only have to loop for the list total of list 2 and not have to do a loop within a loop. Then save off the blacklist items that we know exist in loop 1 - cutting down the blacklist.

Now use the new blacklist for the second loop. All the while using list item 0 and removing items from lists - everywhere but in the second loop that new blacklist has to stay in tact so you can use it for each iteration of list 1.

In the end it took 42 seconds to run the latest test - this is a 988% speed increase from the first test.

I still need a way to chunk the data though - as the importing of massive lists is still an issue since it takes so long. Ideally lists would be imported 10k at a time and be much faster. I'll update this thread if I find a good way to do that.

kev123 · August 13, 2013

great writeup now i'm using the http plugin were ever possible I find when scraping at high threads I can build massive lists quickly. I'll implement this and see how it goes.

rocket976 · August 13, 2013

I found setting my smallest list to a variable then looping through the bigger list with if variable contains shaved a ton of time for me. Thanks for the write up.

CTSchmidt · August 12, 2014

@HelloInsomnia

did you have a code snipped? I rack my head about it. I would be very thankful!

CTSchmidt

arunner26 · August 12, 2014

Nick,

Thanks for your write up. You have some really great ideas there.

You did say that "my end goal is to check 1 million against 10 million; ideally.".

If you want to check 1 million against 10 million I think you may have a problem scaling the in memory approach.

I think that if I was going to do that I would load the 1 million and 10 million into MySQL database tables.

Then I would write a stored procedure to do the compare of the 2 tables on the server and put the results into a 3rd table.

So if you have that database method setup you can call the stored procedure from the UBOT program and then retrieve the final results in table 3 with the UBOT program.

I have had good luck with this approach on one of my projects.

Good luck on your quest.

Andy (Arunner26)

UBotBuddy · August 12, 2014

Great write up @HelloInsomnia

+1 to you.

I am becoming a bigger and bigger fan of the Large Table and Large List function.

Buddy

arunner26 · August 13, 2014

I wouldn't say that the coding with tables in MySQL is easier but with MySQL tables you can use INDEXING to make access for the compare faster.

In addition, if you call a stored procedure to do the work it will use the power of the server to do the compare.

10 million rows is a big table and represents a serious work load.

At my regular job we regularly process tables in the 40 million row range and have history tables that go over 1 billion rows.

So given the right servers and techniques some amazing processing is possible these days.

Sign In

[Case Study] 988% Speed Increase - Working With Large Lists

Recommended Posts

HelloInsomnia 1103

Link to post

Share on other sites

kev123 132

Link to post

Share on other sites

rocket976 62

Link to post

Share on other sites

CTSchmidt 1

Link to post

Share on other sites

arunner26 51

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

arunner26 51

Link to post

Share on other sites

Join the conversation

Browse

Activity