Scraper Program

Jill · March 7, 2019

Hi, I bought Ubot standard edition, back in 2014 I think. Even though Ubot is supposed to be for the non-programmer, it always seemed too complicated to me, so I have never really used it. But I would like to try again. I need to replace a scraper program. I still have the setup files and the license key, but since the dev has gone out of business, it can't phone home to verify the license. And the computers I have it on have died.

I have been watching some of the how-to videos on youtube, but I still don't have a clue what to do. I'm hoping that someone can sort of give me an outline of which functions (is that the right word?) I need to work with. Here is what it needs to do:

First, I have a list of urls, in a text file. One per line. In this case, they are redirects, so I need to save the original url, and get the url that it redirects to, and also save that.

Next, I need to get things off the page source. The old scraper program (Happy Harvester) would get the text between x and y. Each thing you wanted to save was added as a "rule". For example, the text between <title> and </title>. Or the text between "<a href=" and ">Contact</a>" (which would give you the url to their contact page - if it existed).

The program would save all this info in a csv file.

I've seen the Ubot page scraping functions, but they seem to work on the live side and not the source side.

I'm not asking for a total detailed how-to, but hoping someone can tell me to use "this" for my list of urls, and "this" to save the 2 url infos, and "this" to get to the page source, and "this" to save the various texts between x and y. Just sort of an outline. And then I can hopefully watch the videos and read the tutorials to figure out the rest.

Really appreciate any help!

fastlinks · March 9, 2019



clear list(%urls)

add list to list(%urls,$list from text("http://www.yahoo.com

http://www.bing.com",$new line),"Delete","Global")

clear list(%title)

set list position(%urls,0)

loop($list total(%urls)) {

    set(#curr,$list item(%urls,$list position(%urls)),"Global")

    navigate(#curr,"Wait")

    wait for browser event("Page Loaded","")

    divider

    comment("scrape text in between")

    set(#pagehtml,$document text,"Global")

    set(#title,$plugin function("File Management.dll", "$Find Regex First", #pagehtml, "(?<=<title>).*(?=</title>)"),"Global")

    set(#title,$plugin function("File Management.dll", "$Find Regex First", #pagehtml, "(?<=<title>).*(?=</title>)"),"Global")

    set(#keyword,$plugin function("File Management.dll", "$Find Regex First", #pagehtml, "(?<=\\<meta name\\=\\\"keywords\\\" content\\=\\\").*?(?=\\\"\\>)"),"Global")

    alert("title: {#title}

head: {#keyword}")

    add item to list(%title,"{#title},{#keyword}","Don\'t Delete","Global")

    divider

    set(#next,$next list item(%urls),"Global")

}

save to file("{$special folder("Application")}\\test.txt",%title)

fastlinks · March 9, 2019

to get page source

set(#pagehtml,$document text,"Global")

or

set(#pagehtml,$page scrape("<html","</html>"),"Global")

to find text value in between x & y

<license>abc</license>
(?<=<license>).*(?=</license>)
answer: abc

some other useful regex

match email:

([a-z0-9][-a-z0-9_\+\.]*[a-z0-9])@([a-z0-9][-a-z0-9\.]*[a-z0-9]\.(arpa|root|aero|biz|me|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)|([0-9]{1,3}\.{3}[0-9]{1,3}))

to match url:

(?:https?:\/\/)?(ibilik\.)([a-zA-Z\.]{2,6})([\/\w\.-]*)*\/?

to match phone

(\S*\d+\S*){8,16}
(\S*[\d ]\S*){8,16}
$?\d+$?[-.\s]?\d+[-.\s]?\d+

Cynthia Ol2.345.6789 ( contact/whatsApp )

washing machine

Result: Ol2.345.6789

fastlinks · March 16, 2019

.

LuckyUboter · September 5, 2019

Hey thanks fastlinks...uhm in which element exactly would be insert this code which you so generously shared

Sign In

Scraper Program

Recommended Posts

Jill 1

Link to post

Share on other sites

fastlinks 16

Link to post

Share on other sites

fastlinks 16

Link to post

Share on other sites

fastlinks 16

Link to post

Share on other sites

LuckyUboter 0

Link to post

Share on other sites

Join the conversation

Browse

Activity