Jump to content
UBot Underground

CSV based Ecommerce data extraction bot (product description, product picture scraping)


Recommended Posts

Hey guys, 

 

First of all: great to be here, as part of a wonderful growing community of web automation specialists. I hope to learn from the best, and do my part (say thanks) in turn, by helping others.

 

I tried searching for a similar thread, but did not find what I was looking for, so here's my agenda for this project:

I wish to import to a Magento website product descriptions and product images from different websites around the web.

 

The main idea behind the bot is importing different CSV stock files, from different suppliers.

Each individual list contains at least 11k products, column arranged, as follows: 

  • product name,
  • product price,
  • distributor,
  • stock value,
  • producer,
  • packaging,
  • SKU.

My main goal, is to scrape the web, and find data in order to populate 2 different columns, namely:

  • Product description,
  • Product images.

I wish to add these two columns to the initial CSV file, and import them into the Magento database, in such a way that each cell value, from the CSV, will populate the specific product attribute cell in the Magento DB.

 

So basically, the bot, will have to extract data, based on keywords (product name).

Data extraction can be done, directly from Google (but it will require a lot of user interaction, in order to map all required fields) or from a list of links (similar eCommerce websites, manufacturers website, etc. - where initial field mapping has already been done)

 

As I am quite new to UBot Studio, I would appreciate help.

In return, as earlier stated, I wish to make the code public (on this forum), in order for others to benefit from this work.

 

So, anyone interested in helping out?

  • Like 1
Link to post
Share on other sites

Prolly get more "help" from skype Group find TJ (moderator) AKA as "Botguru" ,ask him to add you to the group

 

If you have any specific question then start a post and you will get answers.

 

you can also search, this is best

 

site:ubotstudio.com your question or topic

 

http://wiki.ubotstudio.com/wiki/Main_Page

 

this pops up when you open UBS

 

http://www.ubotstudio.com/resources

 

the first link is Videos very useful tuts

 

it's unlikely but possible anyone will have time to build a bot for you

 

collectively we will help you along the way to build it

 

other wise you can "hire" someone to build it for you

 

HTHelps

 

TC

  • Like 1
Link to post
Share on other sites

Hey TC, 

 

I tried contacting TJ, but no answer up to now.

Too busy, I guess. But still keeping hopes up.

 

To answer your items:

 

  1. I went trough the tutorials, but unfortunately the only close item that I found, was scraping url's - which is not much use to me, at this point. I'll keep on tutorial'ing, nevertheless :)
  2. I'm requesting help, as this is the first time I interact with Ubot Studio
  3. Why would I go and hire someone to code the bot for me? It will take all the fun out of learning, as I am quite a heads on guy... :))
  4. Here's the code I got so far (as promised, I will keep it public, so that others can benefit/ learn (from my mistakes): 
  • clear table(&uploadedcsvtable)
    set(#row counter, 1, "Global")
    clear list(%productname)
    ui open file("Upload CSV file:"#uploadedcsvfile)
    ui save file("File savepath:"#filesavepath)
    create table from file(#uploadedcsvfile&uploadedcsvtable)
    loop($table total rows(&uploadedcsvtable)) {
        navigate("http://www.google.com/ncr""Wait")
        type text(<name="q">"allintext: {$table cell(&uploadedcsvtable#row counter, 0)}""Standard")
        click(<class="gbqfb">"Left Click""No")
        wait($rand(2, 5))
        increment(#row counter)
    }
    save to file(#filesavepath&uploadedcsvtable)

Now the questions:

  1. How do I scrape attributes from Google results? As you probably know the sites that get displayed differ from one another. I'm thinking about using the "exist" function, thus looking for a description class/ tab and scraping the innertext from it.
  2. How to avoid being blocked by Google? As the imported list contains over 10k of products, I can imagine, that querying this much amount of data, will trigger a IP ban from Google. What do you think - proxies?
Link to post
Share on other sites

Thanks mate, added you on Skype.

 

As for the codding part, here's what I have up to this point [Compacting the initial CSV file]:

 

clear list(%Denumire)
clear list(%Pret)
clear list(%Distribuitor)
clear list(%Stoc)
clear list(%Producator)
clear list(%Ambalaj)
clear list(%Cod)
clear list(%SKU)
clear list(%Descriere)
clear table(&uploaded csv file)
clear table(&exported csv file)
ui open file("Upload fisier CSV stoc:"#upoaded stoc file)
ui save file("Unde salvam fisierul CSV:"#exportedcsvfilepath)
ui text box("Nume distribuitor:"#nume distribuitor)
ui drop down("Ambalaj:"" ,cut,buc,ml"#ambalaj)
create table from file(#upoaded stoc file&uploaded csv file)
plugin command("TableCommands.dll""delete from table"&uploaded csv file"Row", 0)
plugin command("TableCommands.dll""delete from table"&uploaded csv file"Row", 0)
plugin command("TableCommands.dll""delete from table"&uploaded csv file"Column", 4)
add list to list(%Denumire$plugin function("TableCommands.dll""$list from table"&uploaded csv file"Column", 1), "Don\'t Delete""Global")
add list to table as column(&exported csv file, 0, 0, %Denumire)
alert("A fost adaugata lista: Denumire produs")
wait(5)
add list to list(%Pret$plugin function("TableCommands.dll""$list from table"&uploaded csv file"Column", 3), "Don\'t Delete""Global")
add list to table as column(&exported csv file, 0, 1, %Pret)
alert("A fost adaugata lista: Pret")
wait(5)
set(#row counter_nume distribuitor, 0, "Global")
loop($list total(%Denumire)) {
    increment(#row counter_nume distribuitor)
    add item to list(%Distribuitor#nume distribuitor"Don\'t Delete""Global")
}
add list to table as column(&exported csv file, 0, 2, %Distribuitor)
alert("A fost adaugata lista: Distribuitor")
wait(5)
add list to list(%Stoc$plugin function("TableCommands.dll""$list from table"&uploaded csv file"Column", 4), "Don\'t Delete""Global")
add list to table as column(&exported csv file, 0, 3, %Stoc)
alert("A fost adaugata lista: Stoc")
wait(5)
add list to list(%Producator$plugin function("TableCommands.dll""$list from table"&uploaded csv file"Column", 2), "Don\'t Delete""Global")
add list to table as column(&exported csv file, 0, 4, %Producator)
alert("A fost adaugata lista: Producator")
wait(5)
set(#row counter_ambalaj, 0, "Global")
loop($list total(%Denumire)) {
    increment(#row counter_ambalaj)
    add item to list(%Ambalaj#ambalaj"Don\'t Delete""Global")
}
add list to table as column(&exported csv file, 0, 5, %Ambalaj)
alert("A fost adaugata lista: Ambalaj")
wait(5)
add list to list(%SKU$plugin function("TableCommands.dll""$list from table"&uploaded csv file"Column", 0), "Don\'t Delete""Global")
add list to table as column(&exported csv file, 0, 7, %SKU)
alert("A fost adaugata lista: SKU")
wait(5)
add list to table as column(&exported csv file, 0, 6, %Cod)
add list to table as column(&exported csv file, 0, 8, %Descriere)
save to file(#exportedcsvfilepath&exported csv file)
alert("Everything done!
What else do you need?")

 

I tested it, over and over again.

It seems to work well with lists of under 5000 items, but when I upload a CSV file with 11k products, it breaks down.

If I run it in UBotStudio, it crashes the system, If I run the compiled version I get the infamous: "Not responding" and again crashed.

 

Does it max out my RAM or Processor?

Why does it refuse to work?

 

Any ideas, anyone?

Edited by Askabar
Link to post
Share on other sites

Hey guys, 

 

Considering the above as "ongoing work" I started the scraping tool.

I settled on only one directory, as scraping directly from GG turned out to be a big pain in the "where the sun don't shine", as pages differed hugely, thus attributes were harder to identify.

 

Well cutting it a little short, here's what I have up to this point:

 

clear list(%denumire produse)
clear list(%descriere produse)
clear table(&produse cu descriere)
ui open file("Adaugati lista produse:"#lista produse)
ui save file("Unde salvam CSV-ul?"#filesavepath)
add list to list(%denumire produse$list from file(#lista produse), "Don\'t Delete""Global")
add list to table as column(&produse cu descriere, 0, 0, %denumire produse)
wait(5)
set(#row counter  denumire produs, 0, "Global")
loop($list total(%denumire produse)) {
    clear cookies
    navigate("http://www.directorproduse.ro/""Wait")
    type text(<class="searchinput">$list item(%denumire produse#row counter  denumire produs), "Standard")
    click(<class="btn_submit">"Left Click""No")
    wait for browser event("Page Loaded", 1)
    click(<class="title">"Left Click""Yes")
    wait for browser event("Page Loaded", 1)
    add item to list(%descriere produse$scrape attribute(<class="descriere marginbottom">"innerhtml"), "Don\'t Delete""Global")
    add list to table as column(&produse cu descriere, 0, 1, %descriere produse)
    save to file(#filesavepath&produse cu descriere)
    increment(#row counter  denumire produs)
}

 

Issues:

 

It navigates the entire 11k item list, adds the product name to the specific list, but when trying to scrape product description based on product name:

- some of the descriptions get trimmed after the first word. I tried saving the product description list, separately in a .txt file, and actually got better results, as in the product description was almost complete (taking out the HTML tags, and all). 

- some of the descriptions are skipped (this I think if due to the fact that the initial product link is in fact an external link, thus redirecting to a different website)

- after a while, the bot simply crashes. That being the main reason why I chose to add the save to file command, inside the loop - so that I could actually see what the program did up to that point.

 

Once again:

- Any ideas, why the bots keep not responding/ crashing?

- Any ideas, on how to save to a CSV file the entire scraped product description, not just the trimmed first word?

Link to post
Share on other sites

Hey guys, 

 

Considering the above as "ongoing work" I started the scraping tool.

I settled on only one directory, as scraping directly from GG turned out to be a big pain in the "where the sun don't shine", as pages differed hugely, thus attributes were harder to identify.

 

Well cutting it a little short, here's what I have up to this point:

 

clear list(%denumire produse)

clear list(%descriere produse)

clear table(&produse cu descriere)

ui open file("Adaugati lista produse:"#lista produse)

ui save file("Unde salvam CSV-ul?"#filesavepath)

add list to list(%denumire produse$list from file(#lista produse), "Don\'t Delete""Global")

add list to table as column(&produse cu descriere, 0, 0, %denumire produse)

wait(5)

set(#row counter  denumire produs, 0, "Global")

loop($list total(%denumire produse)) {

    clear cookies

    navigate("http://www.directorproduse.ro/""Wait")

    type text(<class="searchinput">$list item(%denumire produse#row counter  denumire produs), "Standard")

    click(<class="btn_submit">"Left Click""No")

    wait for browser event("Page Loaded", 1)

    click(<class="title">"Left Click""Yes")

    wait for browser event("Page Loaded", 1)

    add item to list(%descriere produse$scrape attribute(<class="descriere marginbottom">"innerhtml"), "Don\'t Delete""Global")

    add list to table as column(&produse cu descriere, 0, 1, %descriere produse)

    save to file(#filesavepath&produse cu descriere)

    increment(#row counter  denumire produs)

}

 

Issues:

 

It navigates the entire 11k item list, adds the product name to the specific list, but when trying to scrape product description based on product name:

- some of the descriptions get trimmed after the first word. I tried saving the product description list, separately in a .txt file, and actually got better results, as in the product description was almost complete (taking out the HTML tags, and all). 

- some of the descriptions are skipped (this I think if due to the fact that the initial product link is in fact an external link, thus redirecting to a different website)

- after a while, the bot simply crashes. That being the main reason why I chose to add the save to file command, inside the loop - so that I could actually see what the program did up to that point.

 

Once again:

- Any ideas, why the bots keep not responding/ crashing?

- Any ideas, on how to save to a CSV file the entire scraped product description, not just the trimmed first word?

 

If you are using Ubot Studio V5 then that's probably the reason for your issues.

Test it with V4 and see if that works better.

 

There are also a couple of logical "issues" in the code in my opinion.

 

1. You could directly navigate to the search result pages:

navigate("http://www.directorproduse.ro/cauta/{$list item(%denumire produse#row counter  denumire produs)}""Wait")

 

That would save you a couple of steps.

 

Then you have:

click(<class="title">"Left Click""Yes")

 

This could cause issues if your search result has more than 1 item. Because all of the search result items will have a class title. So the bot will try to click on 40 items at the same time. That's not going to work very well!

 

So normally this is a three step process:

1. You run your search

2. You extract the search result URLS

3. You visit each of the search result urls and extract the data

 

You are also not navigating to the other pages of the search results. There are normally multiple pages of results. 

 

Here's a quick example (not tested fully!) Just to give you an idea what I mean.

clear list(%denumire produse)

clear list(%descriere produse)

clear table(&produse cu descriere)

ui open file("Adaugati lista produse:"#lista produse)

ui save file("Unde salvam CSV-ul?"#filesavepath)

add list to list(%denumire produse$list from file(#lista produse), "Don\'t Delete""Global")

add list to table as column(&produse cu descriere, 0, 0, %denumire produse)

wait(5)

set(#counter, 0, "Global")

clear list(%resulturls)

loop($list total(%denumire produse)) {

    clear cookies

    navigate("http://www.directorproduse.ro/cauta/{$list item(%denumire produse#counter)}""Wait")

    wait for browser event("Page Loaded", 1)

    add list to list(%resulturls$list from text($scrape attribute(<class="title">"href"), $new line), "Delete""Global")

    comment("Now you need a loop to navigate to all the other search result pages if you want to do that.")

increment(#counter)

}

 

set(#counter, 0, "Global")

loop($list total(%resulturls)) {

    clear cookies

    navigate("{$list item(%resulturls#counter)}""Wait")

    wait for browser event("Page Loaded", 30)

    add item to list(%descriere produse$scrape attribute(<class="descriere marginbottom">"innerhtml"), "Don\'t Delete""Global")

comment("Why do you add the list to a table now? That's not necessary if you only have 1 column of information!")

    add list to table as column(&produse cu descriere, 0, 1, %descriere produse)

        increment(#counter)

}

 

save to file(#filesavepath&produse cu descriere)

 

Cheers

Dan

  • Like 1
Link to post
Share on other sites

Thanks for the tip Dan,

 

I re wrote it according to your recommendations (3 step process).

It seems to work fine, with small lists, but I shall have to beta test it with my 11k item list.

 

2 issues, I could use your help in:

  1. How do I scrape only one URL (as in this case I get at least 5 for the same product)
  2. How do I get the scrapped attributes (product description in my case) to be fully displayed inside a CSV cell. Right now i use "add list to table as column" starting with row 0 column 1, and it seems to randomly trim the data inside. For some products I only have one word of description for others max 5, and for others complete... Totally strange.
  3. You are right, I recently bought ver. 5 - How do I get my hands on ver. 4 ?

 

Cheers!

Link to post
Share on other sites

Thanks for the tip Dan,

 

I re wrote it according to your recommendations (3 step process).

It seems to work fine, with small lists, but I shall have to beta test it with my 11k item list.

 

2 issues, I could use your help in:

  1. How do I scrape only one URL (as in this case I get at least 5 for the same product)
  2. How do I get the scrapped attributes (product description in my case) to be fully displayed inside a CSV cell. Right now i use "add list to table as column" starting with row 0 column 1, and it seems to randomly trim the data inside. For some products I only have one word of description for others max 5, and for others complete... Totally strange.
  3. You are right, I recently bought ver. 5 - How do I get my hands on ver. 4 ?

 

Cheers!

 

You can send me your updated code (PM) and the keyword where you run into the issue and I can take a look if you like. It's hard to say without seeing what's happening exactly.

 

Dan

Link to post
Share on other sites
  • 2 years later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...