Jump to content
UBot Underground

Page Scrape for collecting data


Recommended Posts

Hello,

i watched tutorial and can't find a clue about data scraping about a page...

i did find a tutorial that scrape a data in table but i don't have the "choose ancestor" feature..

 

i tried to use regex for pagescrape but it didn't work

 

can a standard version do data mining from scrape a page ? or i have to use the professional version

 

so far i found a faster solution using cURL and preg_match in php... i really hope ubot can make it faster..

 

Any idea/tutorial is appreciated..

 

Thanks

Link to post
Share on other sites

Also, update your profile to show us what version of UBot that you have as well as your computing environment. Many times the right person that sees your setup can identify if not provide the correct response for you.

Link to post
Share on other sites

So did you watch all of these videos?

 

http://ubotstudio.com/tutorials.aspx

 

Yupe.. i watched them all...

This is the page that i try to scrape (get the info from ubottutorials.com)

http://www.ip-adress.com/proxy_list/?k=time&d=desc

 

any idea to scrape the data.. at least to have proxy, type, and country... ?

 

in PHP scraped with something like this:

\<TR class\=\".*\"\>\<TD\>(.*)\<\/TD\>\<TD\>(.*)\<\/TD\>\<TD\>(.*)\<\/TD\>

and i could get proxy, type, and country

 

Also the page had class=odd and class=even.. that's why i must use $page_scrape twice for odd list and event list. I would like to know if i could do these steps:

1. choose attribute of innerhtml for the table

2. set a variable to hold the html

3. $replace , class=odd and class=even to class=data

4. set the innerhtml back to the page..

 

the idea is to remove class=odd/even... and have only one class=data for easy scraping...

 

is it possible ?

Link to post
Share on other sites

Wow nice approach... time to turn on my Regex Buddy and try to decrypt your regex...

 

so, if i would like to have the type and country... i should use 3 list am i right ? or can i build table ?

 

What REGEX types does UBOT implement ? PCRE ? POSIX ? Java ? PERL ? etc...

Link to post
Share on other sites

I would rather see you focus on your UBot skills rather than regex. Regex is good for difficult thing but it is overkill for the plain scraping that you will likely be doing.

 

I hardly ever use regex because the native UBot nodes are great at what they do.

Link to post
Share on other sites

@BotBuddy, Thanks for your suggestion... but i think having regex skill will complement the way we made robot..

unfortunately, there are no regex standard..

 

for example:

i build a page to check ip , let say it return: 118.138.50.126 (no html, just a plain text)

then i choose the atributte->outertext->SearchString:

([0-9]{2,3}\.){2}

i was hope to get: 118.138. (i test the pattern in regexbuddy)

but unfortunately it return 118.138.50.126

 

i test with the pattern that you suggest from previous bot:

^\b(??:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){2}\b

 

 

it did work in regexbuddy.. but failed on ubot

 

so i wonder if it has something to do with Regex Type that ubot capable of..

Link to post
Share on other sites

That pattern works fine for me in UBot, the only issue might be that UBot uses {1} and {2} etc to references things that you want to put into a string, so it may be getting confused with that.

 

http://screencast.com/t/9LQkp04q8

 

As far as what we support, we are compatible with PCRE. Let me know if that helps.

Link to post
Share on other sites

It looks like you're misunderstanding the choosing system. When you choose something by an attribute, and then scrape that same attribute, it is going to be the entire attribute's content, not just the part you matched. The choosing system will find the right element, and to modify the attribute value you will have to scrape that attribute and then change it. Hope that makes more sense. Here's the modified version that seems to work for me.

 

http://screencast.com/t/Bi0ZiytsMk

Link to post
Share on other sites

@Eddie Waller , i try to find your constant "Find Regular Expression" is it available in Standard Edition ? if not, can i accomplish the same thing with standard version ?

Link to post
Share on other sites

Ooh sorry, I haven't looked much at what's available in each version of UBot haha. Here's a version using $replace regular expression, where I replace the end of the IP address with an empty parameter.

 

http://screencast.com/t/vZ7QpPUFmH7F

Link to post
Share on other sites
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...