Jump to content
UBot Underground

How can I scrape if one object in a list is different?


Recommended Posts

I'm following a tutorial online but came across a problem.

 

I'm currently trying to scrape the different movie titles and thumbnail urls here http://www.allposter...ers_c18781_.htm

 

The tutorial is telling me to scrape the thumbnail url by using class="thmbd" as the element to scrape.

 

All of the thumbnails are enclosed in this class except for one which uses class="thmbd_nonShadow"

 

This causes a problem where I scrape 36 movie titles but only 35 thumbnail urls which throws my table off by one.

 

I'm trying to figure this out but can't seem to get it!

 

Any ideas?

 

Here is the code that I currently have.

navigate("http://www.allposters.com/-st/Money-Posters_c18781_.htm", "Wait")
clear list(%moviethumbnail)
ui stat monitor("Movie Thumb Nails", $list total(%moviethumbnail))
add list to list(%moviethumbnail, $scrape attribute(<class="thmbd">, "src"), "Delete", "Global")

 

 

It scrapes all of the thumbnail urls EXCEPT for the picture called "Clock and Dollar Bills"

Link to post
Share on other sites

Hi,

 

I changed this line of code. It now uses a wildcard.

 

Changed Code:

add list to list(%moviethumbnail, $scrape attribute(<class=w"thmbd*">, "src"), "Delete", "Global")

 

Kevin

Link to post
Share on other sites

Hi,

 

I changed this line of code. It now uses a wildcard.

 

Changed Code:

add list to list(%moviethumbnail, $scrape attribute(<class=w"thmbd*">, "src"), "Delete", "Global")

 

Kevin

 

Son of a. Thanks!! :D

 

Now let's say that instead the classes were <class="thmbd"> and <class="nonShadow"> (instead of <class="thmbd_nonShadow>)

 

In that case the wildcard solution wouldn't work. How would I do it in that case?

 

You would need to use some of these functions right?

 

$element parent

$element child

$element sibling

 

I've been trying to figure out how to use them with no luck!

Link to post
Share on other sites

Hi,

 

Good question.

 

Quickly off the top of my head.

 

I would add this right after the other add list to list.

add list to list(%moviethumbnail, $scrape attribute(<class="nonshadow">, "src"), "Delete", "Global")

 

Then I would look at the page html. At the elements I want to scrape to see what they have in common and use that.

 

Kevin

Link to post
Share on other sites

Hi,

 

Good question.

 

Quickly off the top of my head.

 

I would add this right after the other add list to list.

add list to list(%moviethumbnail, $scrape attribute(<class="nonshadow">, "src"), "Delete", "Global")

 

Then I would look at the page html. At the elements I want to scrape to see what they have in common and use that.

 

Kevin

 

If I did that wouldn't the <class="nonshadow"> element be placed at the bottom of the list? I'm scraping two sets of data into two different columns.

 

Say I scrape all of the movie titles into column 1, then I scrape the moviethumbnails into column 2. This would throw my data set off then right?

 

Is there an OR operator in ubot?

Link to post
Share on other sites

Use the url ?

clear list(%Thumb)
add list to list(%Thumb, $scrape attribute(<src=w"*cache2.allpostersimages.com/*.jpg">, "src"), "Delete", "Global")

Link to post
Share on other sites

Hi,

 

If I did that wouldn't the <class="nonshadow"> element be placed at the bottom of the list?

 

Yes it would.

 

I'm scraping two sets of data into two different columns.

 

For the two colums use different lists.

 

Say I scrape all of the movie titles into column 1, then I scrape the moviethumbnails into column 2. This would throw my data set off then right?

 

Yes. Without seperate lists the data would be in the same list.

 

Sample code combine lists:

navigate("http://www.allposters.com/-st/Money-Posters_c18781_.htm", "Wait")
wait for browser event("Everything Loaded", "")
ui stat monitor("Movie Thumb Nails", $list total(%moviethumbnail))
clear list(%moviethumbnail)
add list to list(%moviethumbnail, $scrape attribute(<class=w"thmbd*">, "src"), "Delete", "Global")
clear list(%moviedescription)
add list to list(%moviedescription, $scrape attribute(<class="productTitle">, "innertext"), "Don\'t Delete", "Global")
clear list(%movietype)
add list to list(%movietype, $scrape attribute(<class="pttl1 productType">, "innertext"), "Don\'t Delete", "Global")
clear list(%moviedata)
loop($list total(%moviedescription)) {
if($comparison($list position(%moviedescription), "<", $list total(%moviedescription))) {
 then {
	 set(#moviedataitem, "{$next list item(%moviedescription)},{$next list item(%movietype)},{$next list item(%moviethumbnail)}", "Global")
	 add item to list(%moviedata, #moviedataitem, "Delete", "Global")
 }
 else {
 }
}
}
save to file("c:\\downloads\\sample-movie-data.csv", %moviedata)

 

Sample code load list into table:

navigate("http://www.allposters.com/-st/Money-Posters_c18781_.htm", "Wait")
wait for browser event("Everything Loaded", "")
ui stat monitor("Movie Thumb Nails", $list total(%moviethumbnail))
clear list(%moviethumbnail)
add list to list(%moviethumbnail, $scrape attribute(<class=w"thmbd*">, "src"), "Delete", "Global")
clear list(%moviedescription)
add list to list(%moviedescription, $scrape attribute(<class="productTitle">, "innertext"), "Don\'t Delete", "Global")
clear list(%movietype)
add list to list(%movietype, $scrape attribute(<class="pttl1 productType">, "innertext"), "Don\'t Delete", "Global")
clear table(&allpostdata)
add list to table as column(&allpostdata, 0, 0, %moviedescription)
add list to table as column(&allpostdata, 0, 1, %movietype)
add list to table as column(&allpostdata, 0, 2, %moviethumbnail)
save to file("c:\\downloads\\sample-movie-table.csv", &allpostdata)

 

Kevin

Link to post
Share on other sites

Hi,

 

Maybe something like this.

Code:

navigate("http://www.allposters.com/-st/Money-Posters_c18781_.htm", "Wait")
wait for browser event("Everything Loaded", "")
ui stat monitor("Movie Thumb Nails:", $list total(%moviethumbnaildata))
clear list(%moviethumbnaildata)
add list to list(%moviethumbnaildata, $scrape attribute(<class=w"thmbbx* thmbbxhtsml*">, "outerhtml"), "Delete", "Global")
set(#delim, ",", "Global")
clear list(%moviedata)
loop($list total(%moviethumbnaildata)) {
if($comparison($list position(%moviethumbnaildata), "<", $list total(%moviethumbnaildata))) {
 then {
	 set(#moviedataitem, $trim($replace($next list item(%moviethumbnaildata), $new line, $nothing)), "Global")
	 set(#moviedetailpageurl, $replace regular expression($replace regular expression(#moviedataitem, "\"><img class=\".*", $nothing), ".*--><a href=\"", $nothing), "Global")
	 set(#baseurl, $replace regular expression(#moviedetailpageurl, "/-.*", ""), "Global")
	 set(#movietitle, $replace regular expression($replace regular expression(#moviedataitem, "<\\/p><a class=\".*", $nothing), ".*class=\"productTitle\".*?\">", $nothing), "Global")
	 set(#movietype, $replace regular expression($replace regular expression(#moviedataitem, "<\\/div><span.*", $nothing), ".*productType\">", $nothing), "Global")
	 set(#movieartist, "NA", "Global")
	 if($contains(#moviedataitem, "class=\"artistName\" style=\"display: none; \">")) {
		 then {
			 set(#movieartist, $replace regular expression($replace regular expression(#moviedataitem, "<\\/p><a title.*", $nothing), ".*class=\"artistName\" style=\"display: none; \">", $nothing), "Global")
			 set(#movieartistworkurl, "{#baseurl}{$replace regular expression($replace regular expression(#moviedataitem, "\" class=\"catlnkAP.*", $nothing), ".*<\\/p><a title=\".*href=\"", $nothing)}", "Global")
		 }
		 else {
		 }
	 }
	 if($contains(#moviedataitem, "<span class=\"price\">")) {
		 then {
			 set(#movieprice, $trim($replace regular expression($replace regular expression(#moviedataitem, "<\\/span> <div class=.*", $nothing), ".*<span class=\"price\">", $nothing)), "Global")
		 }
		 else {
		 }
	 }
	 if($contains(#moviedataitem, "<span class=\"galleryPriceStrike\">")) {
		 then {
			 set(#movieprice, $trim($replace($replace regular expression($replace regular expression(#moviedataitem, "</span><img alt=\"\".*", $nothing), ".*<span class=\"galleryPriceStrike\">", $nothing), "</span> <span class=\"pprdc\">", " Sale Price: ")), "Global")
		 }
		 else {
		 }
	 }
	 set(#moviepicturesizes, "NA", "Global")
	 if($contains(#moviedataitem, "title=\"\" style=\"\">")) {
		 then {
			 set(#moviepicturesizes, $replace regular expression($replace regular expression(#moviedataitem, "<\\/span><\\/div> <\\/div>.*", $nothing), ".*title=\"\" style=\"\">", $nothing), "Global")
		 }
		 else {
		 }
	 }
	 set(#moviethumbnailimgsrc, $replace regular expression($replace regular expression(#moviedataitem, "\".alt=\".*", $nothing), ".*<img class=\".*\" src=\"", $nothing), "Global")
	 add item to list(%moviedata, "{#movietitle}{#delim}{#movietype}{#delim}{#movieartist}{#delim}{#movieartistworkurl}{#delim}{#movieprice}{#delim}{#moviepicturesizes}{#delim}{#moviethumbnailimgsrc}", "Delete", "Global")
 }
 else {
 }
}
}
save to file("c:\\downloads\\sample-movie-data.csv", %moviedata)

 

Kevin

Link to post
Share on other sites

You can just scrape first column of data in one list and second column of data in a second list.

Then, just add both lists as columns to a table.

add list to table as column(&BothColumnsTable, 0, 0, %FirstColumnList)
add list to table as column(&BothColumnsTable, 0, 1, %SecondColumnList)

HTH

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...