Jill 1 Posted March 7, 2019 Report Share Posted March 7, 2019 Hi, I bought Ubot standard edition, back in 2014 I think. Even though Ubot is supposed to be for the non-programmer, it always seemed too complicated to me, so I have never really used it. But I would like to try again. I need to replace a scraper program. I still have the setup files and the license key, but since the dev has gone out of business, it can't phone home to verify the license. And the computers I have it on have died. I have been watching some of the how-to videos on youtube, but I still don't have a clue what to do. I'm hoping that someone can sort of give me an outline of which functions (is that the right word?) I need to work with. Here is what it needs to do: First, I have a list of urls, in a text file. One per line. In this case, they are redirects, so I need to save the original url, and get the url that it redirects to, and also save that. Next, I need to get things off the page source. The old scraper program (Happy Harvester) would get the text between x and y. Each thing you wanted to save was added as a "rule". For example, the text between <title> and </title>. Or the text between "<a href=" and ">Contact</a>" (which would give you the url to their contact page - if it existed). The program would save all this info in a csv file. I've seen the Ubot page scraping functions, but they seem to work on the live side and not the source side. I'm not asking for a total detailed how-to, but hoping someone can tell me to use "this" for my list of urls, and "this" to save the 2 url infos, and "this" to get to the page source, and "this" to save the various texts between x and y. Just sort of an outline. And then I can hopefully watch the videos and read the tutorials to figure out the rest. Really appreciate any help! Quote Link to post Share on other sites
fastlinks 16 Posted March 9, 2019 Report Share Posted March 9, 2019 clear list(%urls) add list to list(%urls,$list from text("http://www.yahoo.com http://www.bing.com",$new line),"Delete","Global") clear list(%title) set list position(%urls,0) loop($list total(%urls)) { set(#curr,$list item(%urls,$list position(%urls)),"Global") navigate(#curr,"Wait") wait for browser event("Page Loaded","") divider comment("scrape text in between") set(#pagehtml,$document text,"Global") set(#title,$plugin function("File Management.dll", "$Find Regex First", #pagehtml, "(?<=<title>).*(?=</title>)"),"Global") set(#title,$plugin function("File Management.dll", "$Find Regex First", #pagehtml, "(?<=<title>).*(?=</title>)"),"Global") set(#keyword,$plugin function("File Management.dll", "$Find Regex First", #pagehtml, "(?<=\\<meta name\\=\\\"keywords\\\" content\\=\\\").*?(?=\\\"\\>)"),"Global") alert("title: {#title} head: {#keyword}") add item to list(%title,"{#title},{#keyword}","Don\'t Delete","Global") divider set(#next,$next list item(%urls),"Global") } save to file("{$special folder("Application")}\\test.txt",%title) Quote Link to post Share on other sites
fastlinks 16 Posted March 9, 2019 Report Share Posted March 9, 2019 to get page source set(#pagehtml,$document text,"Global")or set(#pagehtml,$page scrape("<html","</html>"),"Global") to find text value in between x & y <license>abc</license>(?<=<license>).*(?=</license>)answer: abcsome other useful regexmatch email: ([a-z0-9][-a-z0-9_\+\.]*[a-z0-9])@([a-z0-9][-a-z0-9\.]*[a-z0-9]\.(arpa|root|aero|biz|me|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)|([0-9]{1,3}\.{3}[0-9]{1,3}))to match url: (?:https?:\/\/)?(ibilik\.)([a-zA-Z\.]{2,6})([\/\w\.-]*)*\/?to match phone (\S*\d+\S*){8,16}(\S*[\d ]\S*){8,16} \(?\d+\)?[-.\s]?\d+[-.\s]?\d+ Cynthia Ol2.345.6789 ( contact/whatsApp ) washing machine Result: Ol2.345.6789 Quote Link to post Share on other sites
fastlinks 16 Posted March 16, 2019 Report Share Posted March 16, 2019 . Quote Link to post Share on other sites
LuckyUboter 0 Posted September 5, 2019 Report Share Posted September 5, 2019 Hey thanks fastlinks...uhm in which element exactly would be insert this code which you so generously shared Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.