Jump to content
UBot Underground

Cafepress Scraper


Recommended Posts

So I decided to fire up my Ubot Studio and try to learn how to create a scraper as I need one from Cafepress. 

 

I'm able to scrape the results I wanted but I'm having some issues with the saved csv file. For some reason it's putting the Title and URL's on separate lines.

 

I'm also wondering how to handle scraping multiple pages (over 100 just for this product type). 

ui text box("URL To Scrape:", #url)
ui stat monitor("Product Titles:", $list total(%titles))
ui stat monitor("Product Thumbnails:", $list total(%thumbnails))
ui stat monitor("Product Images:", $list total(%images))
ui stat monitor("Product Description:", $list total(%desc))
ui stat monitor("Price:", $list total(%price))
ui stat monitor("Product URL:", $list total(%produrl))
define Scrape Data {
    clear table(&products)
    clear list(%titles)
    clear list(%thumbnails)
    clear list(%images)
    clear list(%desc)
    clear list(%price)
    clear list(%produrl)
    navigate(#url, "Wait")
    wait(5)
    add list to list(%titles, $scrape attribute(<class="grid-title">, "outertext"), "Don\'t Delete", "Global")
    add list to table as column(&products, 0, 0, %titles)
    add list to list(%thumbnails, $scrape attribute(<class="productImage">, "src"), "Don\'t Delete", "Global")
    add list to table as column(&products, 0, 1, %thumbnails)
    add list to list(%price, $scrape attribute(<class="qv-price">, "outertext"), "Don\'t Delete", "Global")
    add list to table as column(&products, 0, 2, %price)
    add list to list(%produrl, $scrape attribute(<itemprop="name">, "href"), "Don\'t Delete", "Global")
    add list to table as column(&products, 0, 3, %produrl)
    loop($list total(%produrl)) {
        navigate($next list item(%produrl), "Wait")
        add item to list(%images, $scrape attribute(<itemprop="image">, "fullsrc"), "Don\'t Delete", "Global")
        add item to list(%desc, $scrape attribute(<itemprop="description">, "outertext"), "Don\'t Delete", "Global")
    }
    add list to table as column(&products, 0, 4, %images)
    save to file("Desktop\\cafepressresults.csv", &products)
}
Scrape Data()
alert("I\'m done! What Else You Got For Me")
Link to post
Share on other sites

Okay, so I made a change to how I scrape the title which has resolved by issue with the trailing space and carriage return. Now I'm adding the description of the product to the results but I'm running into the same issue where it outputs to a separate line due to the <li> tags. 

 

ui text box("URL To Scrape:", #url)
ui text box("Pages To Scrape:", #pages)
ui stat monitor("Product Titles:", $list total(%titles))
ui stat monitor("Product Thumbnails:", $list total(%thumbnails))
ui stat monitor("Product Images:", $list total(%images))
ui stat monitor("Product Description:", $list total(%desc))
ui stat monitor("Price:", $list total(%price))
ui stat monitor("Product URL:", $list total(%produrl))
define Scrape Data {
    clear table(&products)
    clear list(%titles)
    clear list(%thumbnails)
    clear list(%images)
    clear list(%desc)
    clear list(%price)
    clear list(%produrl)
    set user agent("Firefox 6")
    navigate(#url, "Wait")
    wait(5)
    loop(#pages) {
        add list to list(%titles, $scrape attribute(<itemprop="name">, "title"), "Don\'t Delete", "Global")
        add list to table as column(&products, 0, 0, %titles)
        add list to list(%thumbnails, $scrape attribute(<class="productImage">, "src"), "Don\'t Delete", "Global")
        add list to table as column(&products, 0, 1, %thumbnails)
        add list to list(%price, $scrape attribute(<class="qv-price">, "outertext"), "Don\'t Delete", "Global")
        add list to table as column(&products, 0, 2, %price)
        add list to list(%produrl, $scrape attribute(<itemprop="name">, "href"), "Don\'t Delete", "Global")
        add list to table as column(&products, 0, 3, %produrl)
        click(<innertext="NEXT »">, "Left Click", "No")
        wait(6)
    }
    loop($list total(%produrl)) {
        navigate($next list item(%produrl), "Wait")
        add item to list(%images, $scrape attribute(<itemprop="image">, "fullsrc"), "Don\'t Delete", "Global")
        add item to list(%desc, $scrape attribute(<((tagname="div" AND itemprop="description") AND class="box pdp-unit pdp-producttypedesc")>, "innerhtml"), "Don\'t Delete", "Global")
    }
    add list to table as column(&products, 0, 4, %images)
    add list to table as column(&products, 0, 5, %desc)
}
Scrape Data()
plugin command("DatabaseCommands.dll", "connect to database", "server=11.22.33.44;uid=user1; pwd=password; database=userdb; port=3306; pooling=false") {
    plugin command("DatabaseCommands.dll", "query", "insert INTO cafepressdata (title, thumbnail, price, produrl, image, description) VALUES ('{$plugin function("TableCommands.dll", "$list from table", &products, "Column", 0)}');")
}
save to file("C:\\Users\\botmaster\\Desktop\\zzzscraper.csv", &products)
alert("I\'m done! What Else You Got For Me")

 

zzzscraper.csv

Link to post
Share on other sites

If you don't want the html tags in there you can drop in the replace then drop the scrape attribute into that when you are adding the items to your list.
Just replace the tags that you don't want with space... i.e. literally press the space bar instead of typing a character.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...