Jump to content
UBot Underground

How to Scrape 260 000 Pages of data with captcha every 5 pages


Recommended Posts

Hi, i need to scrape some info from pages. I need to save them in csv file. So i will do a bot like this:

Loop 260 000 and will use count variable and increament command to go from zero to 260 000.

 

I will use variables, scrape to variables and then add a line in CSV format for each page. So if i do it all at once i will have 260 000 lines. And i m sure bot will not be able to hold so much data, anyone know how much is optimum?

 

Threading. i would like to use this because i will need to do this job fast. To do it in 1 day i need like 40 threads. Is this possible?

 

Solving captcha every 5 pages is pain in the ass, will this work? What if i thread, then my captcha account will need to sovle captcha maybe like 10 at one time, is this possible.

 

Will server IP ban me if i navigate so fast?

 

Please give me all info that i m not aware off.

Link to post
Share on other sites

use proxys

split your job not only with threads also with more instances of your bot

split the list also as i know 50.000 per list is a passable value

 

last but not least

the answer how fast it is is the question how big is the data u wanna scrape from your sites

to give u more infos we must see the site also what u wanna scrape from

 

greetz

Link to post
Share on other sites

Do i really need proxys?

I need many data. But page and server of the site is fast and can hold much traffic. I did a math for 40 bots it will take me 10h.

 

But i will burn 60 000 captchas, is this normal? Can deathbycaptcha provide this captchas at so short time?

Link to post
Share on other sites

if u hammer a server with request normaly the server bans u so that means u need proxys or slow down your requests

captcha needs minimum 10 sec and also not any captcha will be solve with the first run so calculate 20sek per captcha

 

next question is where u wanna run 40 instances of the bot? one machine is not enougth because a bot with threads needs mosttime nearly a gig ram

 

i thing to be realistic 60000 captchas from a site who knows about stealing data from and 260tsd sites needs a couple of days.

Link to post
Share on other sites

Can you give me example for using proxys? Do i need 26 proxys if i will do 26 hammering? But all this 26 hammering will hammer 10 000 times so maybe even with proxyes i will be baned?

 

Can someone give me some numbers that my machine can run?

 

I have 8gb ram and phenom 2 x4 3.40ghz

 

And how about threads... how many per bot and how many bots can i run? I think i can easy run 6-7 bots. But i will then need 7 threads per bot or at least 6. Is this possible?

And do i need to open new threads in new window or i can just run threads without new browser?

 

And is maybe then better to do less bots but with more threads?

 

I have decided to chop the work to 26 pieces, each pieace to do 10 000 pages equals 260 00. So how is it best to do? 1 bot with 26 threads or 2 bots with 13 or 4 with 7 or 5 with 5?

 

I have ask for captcha solving, i know it takes time and there be worng captchas, but i m asking this:

If i will need to solve captcha every 5 pages per thread and there are 26 threads, that means i will send requersts for captcha every second for maybe more than 15 captchas, will this work?

Link to post
Share on other sites

mamica sorry but source this job out !

your questions brings me to the point that dont got enougth experirnce with ubot to do a job like that

learn ubot and most of your questions are answered

Link to post
Share on other sites

I have many experience in creating bots. i have create over thousand of bots. the reason why i open this topic is to ask on what walls can i hit doing this huge job.

 

Creating bot is piece of cake, i already have the bot in my head. It will be very short, few lines.

 

I just want to know how is captcha solving going on, since i never use so many captcha in so short time. 60 000 captchas burn in 1 day... is this even possible?

 

I will use proxys and i already have some ideas in my head. But bigggest question is how can i test if server will bann me if i surf too fast?

Maybe i should try without proxys and see how far i go?

 

 

I will soon need to start creating this bot and before i do this i need as much info as i can get.

Link to post
Share on other sites

okay undertstand

but u have to know if u use free proxys u will have a lot of trouble with captcha if it comes from google.

i got proxy gobliner and this program founds many proxys we talk about 1000 or more but in the end if i test all against google captcha it left 5 sometimes a bit more...i do tzhis proxy buiz since months and sometime i drives me nuts! so thats one of the hardest parts i thing

...and i dont got so much expirience with privat proxys so dont can tell ya about

Link to post
Share on other sites

First, if you have any inclination to do this at home, don't. You are doing a crapload of scraping and sooner or later your ISP is going to ban you. Then you are royally screwed. Do this right - on leased Windows servers. Try VPS at the start.

 

Split the workload up to start. Host them on different vendors and spec them as need be (Quad core, 2GB+ memory, 100Mpbs line, etc). If throughput is slow, you can always tweak the hardware by scaling up and out (i.e. add more memory, add another core, add another server).

 

Don't use public proxies. That will be a huge bottleneck. Always go with high speed elite.

 

Also, another suggestion. Create a disk structure that can be easily aggregated back to your local system for analysis/manipulation/etc.

 

Server 1 on Host A

High speed elite proxies

Windows 7 64 bit/Windows server

Disk structure
/root
 /category1
   /subcategory1
   /subcategory2
 /category2
   /subcategory1

 

Server 2 on Host B

High speed elite proxies

Windows 7 64 bit/Windows server

Disk structure
/root
 /category1
   /subcategory1
   /subcategory2
 /category2
   /subcategory1

 

Store content in a useful information architecture. Then FTP it back down to your local computer and aggregate it in a tree folder model. You can then write a directory walker to pour into a Mysql/SQL Server database to add even more utility. You could use off the shelf tools to index the files and even Ubot (!!!) to do the work locally too.

 

I recommend buying dual monitor video cards and having equal number of LCD screens. Each screen will be dedicated to a server. This will allow you to watch each execute on separate screens (and avoid flipping back and forth) to take action quickly if something goes awry.

 

As of Captcha, Death By Captcha or sit there and do it yourself. Really no other choices.

Link to post
Share on other sites

thx for this answer

first i got esxserver extern also running win extern

but sure i also can do that at home if i use proxys

i liv in germany and here the isp are not that stupid like in usa

if i do spamthings over proxy there will do nothing so im relaxed with this

and yes i also use a readymade structure but bit more tricky

i use truecrypt containers so the brings me the possiblety to clone a container realy fast

also its a bit more save

Link to post
Share on other sites

Thanks for all info. Yea i was thinking that this job is not easy to be done.

 

But ISP banning, isnt that too much?

 

And using proxys, hmmm, so you say if i use proxys some of them could not read captcha and then i m doom?

Link to post
Share on other sites

More and more i m thinking to outsorce this job, because i think payment for it is maybe too low. 600$. Is this maybe too low for this job?

Link to post
Share on other sites

Why not purchase more captcha accounts? Get 20 captcha accounts backed with 20 private proxies running off half dozen beefy windows VPS's. Big jobs equal big overhead. Problem solved, if you want it done super fast. If you have the time break those resources down and spread them out to save money..

Link to post
Share on other sites

Info i was supposed to scrape was public. Some info about companyes that is published on one website and you can even download them in pdf format for free. I have create scraper and is scraping and creating a CSV file.

 

General Lee, yea you are right and i could crate more bots and each will have unique captcha account and unique IP(proxy)

 

Well i made a good plan and i think this would work 100%. But problem is that i recently fuigured that there are 700 000 pages and not 260 000, this makes captcha double, so i will spend 200$ for captcha only and i will need to do this job for long time so 600$ is IMO not enough.

 

Blumi40 i would gladly share a job with you, but you see that payment is very low... not worthy IMO.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...