Jump to content



Photo

Scraping A Crazy Amount Of Data


  • Please log in to reply
6 replies to this topic

#1 Biks

Biks

    Advanced Member

  • Fellow UBotter
  • PipPipPip
  • 215 posts
  • OS:Windows 8
  • Total Memory:More Than 9Gb
  • Framework:v4.0
  • License:Professional Edition

Posted 05 February 2018 - 09:53 AM

I have never gotten Ubot to scrape beyond a certain point. It seems once I hit around 42,000 entries, the whole thing collapses. I just had this happen twice on the same site. I'm guessing I'm running out of memory. At this point I'm using 16 GIG, will doubling my memory help?

 

I've recently been grabbing followers on a few websites that require you keep loading a new batch of users as you scroll down the page. (using the Javascript load command) There's no way of stopping, saving and continuing beyond a certain point, it's just offering me an endless list.

 

As an example: The Spotify Twitter account has 2.5 million followers. How the hell would I scrape 2.5 million entries with Ubot? Any other places/services that could do this?



#2 giganut

giganut

    softwareautomation.org

  • Fellow UBotter
  • PipPipPip
  • 535 posts
  • LocationLost In Space!
  • OS:Windows 10
  • Total Memory:4Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 05 February 2018 - 10:22 AM

Have you tried using this plugin? http://network.ubots...gin-large-data/

You can also take a look at this plugin as well: http://network.ubots...-ubot-discount/

 

You can also scrape the data to a file and append a number to the file and clear the list from within ubot. Then load the files back in as needed.


softwareautomation-banner.png


#3 Biks

Biks

    Advanced Member

  • Fellow UBotter
  • PipPipPip
  • 215 posts
  • OS:Windows 8
  • Total Memory:More Than 9Gb
  • Framework:v4.0
  • License:Professional Edition

Posted 05 February 2018 - 10:33 AM

Have you tried using this plugin? http://network.ubots...gin-large-data/

You can also take a look at this plugin as well: http://network.ubots...-ubot-discount/

 

 

From what I can see, these deal with data once you've acquired it. The problem is I need to hold 2.5 million entries in memory (before the scrape) before I can do anything with it. I can't parse the INPUT into manageable smaller sections.

 

Giganut, how many Twitter followers can you scrape at one time?



#4 giganut

giganut

    softwareautomation.org

  • Fellow UBotter
  • PipPipPip
  • 535 posts
  • LocationLost In Space!
  • OS:Windows 10
  • Total Memory:4Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 05 February 2018 - 11:26 AM

When using the large data plugin I have scraped over 1,000,000 give it a try that's all I can say.


softwareautomation-banner.png


#5 HelloInsomnia

HelloInsomnia

    Advanced Member

  • Moderators
  • 2859 posts
  • OS:Windows 10
  • Total Memory:More Than 9Gb
  • Framework:v4.5+, unsure
  • License:Developer Edition

Posted 05 February 2018 - 01:11 PM

At this point I'm using 16 GIG, will doubling my memory help?

 

No, Ubot is 32 bit and so there is a limit to how much memory it can use, I believe its 2GB max.

 

But try the large data plugin like Giganut suggested or maybe try saving the data as you go into a database or something.



#6 Biks

Biks

    Advanced Member

  • Fellow UBotter
  • PipPipPip
  • 215 posts
  • OS:Windows 8
  • Total Memory:More Than 9Gb
  • Framework:v4.0
  • License:Professional Edition

Posted 05 February 2018 - 06:24 PM

clear list(%followers)
navigate("https://soundcloud.com/random-house-audio/followers","Wait")
wait for browser event("Everything Loaded","")
loop(9999) {
    add list to list(%followers,$scrape attribute(<class="userBadgeListItem__heading sc-type-small sc-link-dark sc-truncate">,"href"),"Delete","Global")
    run javascript("window.setTimeout(function() \{
window.scrollTo(0, document.body.scrollHeight)
 \}, 500)")
    wait(3)
}
save to file("C:\\Users\\Public\\Ubot\\Soundcloud\\SCRAPED USERS.txt",%followers)

I'm basically do this. Each javascript page load gives me 25 new profiles at the end of the column. Once I hit 42,000+ (aprox loop #1680) - end of game, the system locks up.

 

What I would like to do is save, then somehow delete everything from memory up to 42,000, then continue - but I can't, it's one single long page of results. From what I remember, Twitter does the same thing. What I've done is move the SAVE TO FILE command back on each loop, so I catch everything before the crash. But I'm still stuck at 42,000.

 

Elsewhere, I HAVE scraped long sequences where it loads page 1, 2, 3 etc. I save a bunch, create a new browser and reset, then continue. Not here.



#7 Biks

Biks

    Advanced Member

  • Fellow UBotter
  • PipPipPip
  • 215 posts
  • OS:Windows 8
  • Total Memory:More Than 9Gb
  • Framework:v4.0
  • License:Professional Edition

Posted 06 February 2018 - 09:11 AM

So what other software can I learn/use to do this? (that won't crash)

 

or

 

I would really love to have this scraped: https://soundcloud.c...io_us/followers

And maybe this too: https://soundcloud.c...dible/followers

 

Anyone willing to run my code on their machine? Does anyone know of anyone who could/would do this? How much do you/they need?






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users