Understanding "scrape Attribute" Vs "find Regular Expression" And Properly Using The "element Editor"

steelersfan · August 25, 2017

So, to my current understanding the function "scrape attribute" is used to get/find code that is "inside" of an attribute, but not to actually pull anything specific out of it?

And "find regular expression" is used to actually pull the desired text from some scraped code/text? (when you already know the regular expression needed)

Using the element editor, it seems that the regular expression selector is used to target matches, but only under specific conditions? This is the part where I am getting lost. The same regex that works for me using "find regular expression", will not work in the element editor. I am struggling to understand what I am doing wrong here...

I am reminded of a previous issue that I had with using the advanced element editor, where helloinsomnia helped me to understand that I wouldn't need to use regex in the advanced element selector, as it was enough to use wildcards, but I can't help but wonder what good is using regex in the advanced element selector? (I mean it was included for a reason, so what is that reason?)

I want to try and understand how to best use each of these scraping tools (particularly the proper use of the advanced element editor), but each new time I utilize regex, it gets more confusing than the last time. Does anyone understand my confusion? Or perhaps have any insight as to what is making this so confusing in the first place?

Code Docta (Nick C.) · August 25, 2017

The Element Editor and Regex are like a Swiss Army Knife. You can use Regex on the Document text like Xpath. However, they all can be use to get the same results. I find Xpath the most versatile and yet I find that sometimes I will need to use Regex to clean it up a little bit. Same with $scrape attribute. I have never used Regex in the Element Editor.

Using Regex in UBot only is meant for one line or a string of text(not good for multi line in some cases). Specifically $find regex I find it not to work as expected on some HTML pages but does fine in a ubot list. Again I use another method first then if I need to remove or extract something specific then bring in the Regex.

$scrape attribute allows you to get a list of elements as well as a single element without having to know Regex or Xpath. Xpath is easy to learn and Dan has a free plugin for it and a great tutorial. Generally if I am using Xpath and if can't dial in what I am trying to scrape I get the outer/inner HTML then use Regex but this is in rare cases. You can achieve the same with Element Editor but I feel Xpath is much easier to grasp and more flexible. Since learning Xpath I rarely use anything else.

Hope that helps you.

Regards,

CD

HelloInsomnia · August 25, 2017

The scrape attribute function has an input field that I call the selector input field because it uses the selector language as an input and it will not just accept any input. In order to use this Ubot makes it easy via the element selector (the button with the green brackets). When you use the element selector you are highlighting different HTML elements on the page and when you click on one Ubot will determine which unique element that is by looking at the element attributes and trying to figure out what makes that element unique.

To understand how to use it properly you need to understand whats going on. HTML is a tagging language, and every tag is essentially an element Ubot can target. But there are only so many tags and typically pages use the same tag over and over again. So if you have a button and you want to scrape the value - like on the Ubot resources page the earn commission button - you need to tell Ubot somehow to scrape that particular button and not say the download now button instead (they are actually different kinds of elements but just pretend for an example they are the same).

When you open the advanced element editor you can be more precise and tell Ubot exactly how to target that specific element. This is where you manually look at the attributes which make up the element.

They look like so:

<tagname attribute="value" attribute2="value2">possible innerhtml</tagname>

So if you look at the earn commission button again:

<input id="upgrade_notif_link_button" style="background: #d76161; color: white; margin-bottom: 20px; margin-top: 20px" type="submit" value="Earn Commission" name="subscribe">

Input is the element - it is the HTML tag. The attributes which make it up are: id, style, value and name.

When you select an attribute you may choose to select "Exact Match" as the match type. This is on by default. If its a button named "subscribe" in this case or has the id it has those things probably won't change and so that is okay to target by.

In some cases you may have something like a captcha which may have an url that always changes the last part so you may want to use a wildcard and put a * on the part that changes.

The regular expression can be used but usually you can just use a wildcard. But because its the point of the question we can go over a scenario where you may want to use it.

Say you have 2 captchas on the page each with the same beginning and middle of the url like so:

http://domain.com/captchas?id=

And the first always had an id with a word and the second always had an id with a number.

If you wanted to target the second one you could use the regular expression to target the src attribute of the image (which is basically the url of the image) and then use a regex to specify you want a number and not just a wildcard.

In short, when you use the advanced element editor the match type helps you target the attribute you are looking for so regex is only applied to that attribute and only applied to tell Ubot what you want to target.

So if you wanted to target the innertext and apply a regular expression to scrape something out of it. You would first scrape the innertext using scrape attribute and in most cases you wouldn't use a regular expression to do that. Once you have the innertext then you can apply a find regular expression on to it.

steelersfan · August 27, 2017

The Element Editor and Regex are like a Swiss Army Knife. You can use Regex on the Document text like Xpath. However, they all can be use to get the same results. I find Xpath the most versatile and yet I find that sometimes I will need to use Regex to clean it up a little bit. Same with $scrape attribute. I have never used Regex in the Element Editor.

Using Regex in UBot only is meant for one line or a string of text(not good for multi line in some cases). Specifically $find regex I find it not to work as expected on some HTML pages but does fine in a ubot list. Again I use another method first then if I need to remove or extract something specific then bring in the Regex.

$scrape attribute allows you to get a list of elements as well as a single element without having to know Regex or Xpath. Xpath is easy to learn and Dan has a free plugin for it and a great tutorial. Generally if I am using Xpath and if can't dial in what I am trying to scrape I get the outer/inner HTML then use Regex but this is in rare cases. You can achieve the same with Element Editor but I feel Xpath is much easier to grasp and more flexible. Since learning Xpath I rarely use anything else.

Hope that helps you.

Regards,
CD

Thanks a lot Code Doc!

Indeed, I have found that xpath is a lot better and have Dan's tutorials, the plugin and xpath builder! They have been a great help for me in this regard thus far!

The main thing I get from your words, which is very helpful indeed, is that I should grab the text content that I need from the website/s first, and then process it later. I have learned that the hard way with my last project involving YouTube comment scraping as well!

The second most important lesson your words taught me (and correct me if I'm wrong), is that the element editor should not really be relied upon to do the bulk of the work in refining a scrape. That it is better to use to get closer to the target, as opposed to scraping the whole page. And if I approach its use like that, my frustrations will not be as large as they have been! This again, is something I have thought may be the case from my experience, and from talking to others each time I run into these issues. Hopefully, I am correct in this assumption?

Anyway, thanks for helping me to better understand how to more effectively utilize the tools available. I have found it terribly important to fully understand the proper use of the element editor, as it has caused much confusion and frustration for me in the past!

Also, I will indeed try to keep to mostly using xpath over regex, whenever possible!

steelersfan · August 27, 2017

The scrape attribute function has an input field that I call the selector input field because it uses the selector language as an input and it will not just accept any input. In order to use this Ubot makes it easy via the element selector (the button with the green brackets). When you use the element selector you are highlighting different HTML elements on the page and when you click on one Ubot will determine which unique element that is by looking at the element attributes and trying to figure out what makes that element unique.

To understand how to use it properly you need to understand whats going on. HTML is a tagging language, and every tag is essentially an element Ubot can target. But there are only so many tags and typically pages use the same tag over and over again. So if you have a button and you want to scrape the value - like on the Ubot resources page the earn commission button - you need to tell Ubot somehow to scrape that particular button and not say the download now button instead (they are actually different kinds of elements but just pretend for an example they are the same).

When you open the advanced element editor you can be more precise and tell Ubot exactly how to target that specific element. This is where you manually look at the attributes which make up the element.

They look like so:
<tagname attribute="value" attribute2="value2">possible innerhtml</tagname>
So if you look at the earn commission button again:
<input id="upgrade_notif_link_button" style="background: #d76161; color: white; margin-bottom: 20px; margin-top: 20px" type="submit" value="Earn Commission" name="subscribe">
Input is the element - it is the HTML tag. The attributes which make it up are: id, style, value and name.

When you select an attribute you may choose to select "Exact Match" as the match type. This is on by default. If its a button named "subscribe" in this case or has the id it has those things probably won't change and so that is okay to target by.

In some cases you may have something like a captcha which may have an url that always changes the last part so you may want to use a wildcard and put a * on the part that changes.

The regular expression can be used but usually you can just use a wildcard. But because its the point of the question we can go over a scenario where you may want to use it.

Say you have 2 captchas on the page each with the same beginning and middle of the url like so:
http://domain.com/captchas?id=
And the first always had an id with a word and the second always had an id with a number.

If you wanted to target the second one you could use the regular expression to target the src attribute of the image (which is basically the url of the image) and then use a regex to specify you want a number and not just a wildcard.

In short, when you use the advanced element editor the match type helps you target the attribute you are looking for so regex is only applied to that attribute and only applied to tell Ubot what you want to target.
So if you wanted to target the innertext and apply a regular expression to scrape something out of it. You would first scrape the innertext using scrape attribute and in most cases you wouldn't use a regular expression to do that. Once you have the innertext then you can apply a find regular expression on to it.

Hmm, so I think I understand now! Using the element editor to find only specific things is the best and proper use of it, and using regex to that end is the proper way. An improper or counterproductive way to use regex, is to use it to find multiple sources of the same type on the page, which is better served by using wildcards, or just scraping a specific area and then using further methods on that area you isolated. Is that a correct assessment of what I should be using the advanced element editor for?

HelloInsomnia · August 28, 2017

Hmm, so I think I understand now! Using the element editor to find only specific things is the best and proper use of it, and using regex to that end is the proper way. An improper or counterproductive way to use regex, is to use it to find multiple sources of the same type on the page, which is better served by using wildcards, or just scraping a specific area and then using further methods on that area you isolated. Is that a correct assessment of what I should be using the advanced element editor for?

I'll try and make a video tomorrow to explain, it will be easier if you see it. I think you mostly got it but seeing an example would help a lot.

Code Docta (Nick C.) · August 29, 2017

Thanks a lot Code Doc!
Indeed, I have found that xpath is a lot better and have Dan's tutorials, the plugin and xpath builder! They have been a great help for me in this regard thus far!

The main thing I get from your words, which is very helpful indeed, is that I should grab the text content that I need from the website/s first, and then process it later. I have learned that the hard way with my last project involving YouTube comment scraping as well!

The second most important lesson your words taught me (and correct me if I'm wrong), is that the element editor should not really be relied upon to do the bulk of the work in refining a scrape. That it is better to use to get closer to the target, as opposed to scraping the whole page. And if I approach its use like that, my frustrations will not be as large as they have been! This again, is something I have thought may be the case from my experience, and from talking to others each time I run into these issues. Hopefully, I am correct in this assumption?

Anyway, thanks for helping me to better understand how to more effectively utilize the tools available. I have found it terribly important to fully understand the proper use of the element editor, as it has caused much confusion and frustration for me in the past!

Also, I will indeed try to keep to mostly using xpath over regex, whenever possible!

You are very welcome!

Really depends on the situation. And how well you know your tools. Like Aymens xpath in his http post plugin gets the xpath expression better than Dan's generic xpath plugin.

I use the element editor to get as much as I can, same with xpath. Then if I need to process further I use regex to pull it out from the process before it.

I also nest my functions so that it ends up doing all the work in the least amount of command nodes.

What I mean by nesting is put a $scrape attribute inside a $find regex/$replace with regex in the "text" field. I am also known to nest many functions.

And helloinsomnia's explanation above is kinda the same thing using regex in the element editor. If I understand correctly. I didn't know you had to scrape first to get it to work(does not work on its own). I didn't know it worked that way till now.

Regard,

CD

HelloInsomnia · August 29, 2017

Here is that video I promised:

Code Docta (Nick C.) · August 31, 2017

Awesomeness!!! finally a good explanation of the #ubot Advance Element Editor. Great Job Nick!!

Cheers,
CD

steelersfan · September 2, 2017

Awesomeness!!! finally a good explanation of the #ubot Advance Element Editor. Great Job Nick!!

Cheers,
CD

Absolutely, it cleared up my questions perfectly! Thanks Helloinsomnia!

Sign In

Understanding "scrape Attribute" Vs "find Regular Expression" And Properly Using The "element Editor"

Recommended Posts

steelersfan 38

Link to post

Share on other sites

Code Docta (Nick C.) 638

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

steelersfan 38

Link to post

Share on other sites

steelersfan 38

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

Code Docta (Nick C.) 638

Link to post

Share on other sites

HelloInsomnia 1103

Link to post

Share on other sites

Code Docta (Nick C.) 638

Link to post

Share on other sites

steelersfan 38

Link to post

Share on other sites

Join the conversation

Browse

Activity