Scraping BIG stuff

Luke · April 3, 2010

I've been a scraping fool this week but one thing that has stopped me cold is when I tried to scrape all of the text off of one of my old HTML pages and store it in a csv file for content to be made into a blog post later.

I think I'm hitting some kind of limit... I've tried just scraping the text (innertext) and to it's own list, but even with a smallish page of 300 words, when saved to file, turns into gibberish like a bunch of Y's! (In notepad... no spreadsheet program can open it.)

What I'd prefer is making a text file, where each row is an entire page of HTML-included text that I chose between '$scrape page' placemarker tags. I realize this isn't working because of the natural line breaks it must be scraping too, and probably because html tags themselves cant be scraped? Am I right?

So does anyone have any idea how to scrape a large block of text in an html page?

Cheers,

Luke

Frank · April 3, 2010

Maybe scrape the text into a list using a new line as the delimiter?

Just a thought.

Frank

Luke · April 4, 2010

Maybe scrape the text into a list using a new line as the delimiter?

Thanks for the idea, Frank. I really don't have a problem with what to grab on this problem, because I'm taking my text from an old XSitePro-made HTML page and they have those handy XSP DIVs labeling every part of the page.

Ideally I'd like to just grab everything inside a Table Division called "XSP_MAIN_PANEL" and call all of the resulting text and html tags just 1 item in my list. When I wasn't able to grab that alone, I came up with many other ways to identify it, but no matter what, as a big chunk of text and html, the resulting list displays broken.

If this is impossible, (Someone please let me know if that's the case?) I have two options:

1. Either find a way to scrape all of the text (by attribute, such as innertext)without the html, or

2. Just scrape each paragraph on the page inside Paragraph tags into Separate list items.

I've been trying to figure out #1, it still eludes me. And #2 downright scares me because I have no idea how I'm going to tie all of those list items to a single page, and use them correctly later.

Anyone been through this before? Surely I'm not the first to want to scrape all the words from a single page?

Aaron Nimocks · April 4, 2010

Do you have a link to what you are trying to scrape?

Ive scraped articles before and saved in a list then posted in WP. Sure this is probably the same thing.

Luke · April 4, 2010

Do you have a link to what you are trying to scrape?

Hi Aaron,

I just PM'd you with an URL.

It sounds like the bot you're speaking of would do the trick, but I still can't see how its' grabbing it all, so I'm really looking forward to your evaluation.

Cheers,

Luke

Luke · April 4, 2010

Ive scraped articles before and saved in a list then posted in WP. Sure this is probably the same thing.

Aaron,

I've found your ezine article scraper bot (very nice, BTW) and compared it's guts to mine.

I was scraping much the same way, adding to a list the scraped contents of a Division. You did separate the title from the body from the resource box, but basically your scraping included html tags and line breaks, which I was worried it would not allow.

The part that is confusuing to me is that for the %Title list you did a "$Page Scrape" and gathered everything between a set of DIV tags. But for the %Article Body list, you chose instaed to use the "$Scrape Chosen Attribute" of 'innerhtml.'

The only difference between these two scraping tags in the Commando Guide is how you choose the thing to scrape. I frankly can't see anything to be gained from using $Scrape Chosen Attribute because it seems much more limited than full-featured $Page Scrape.

So why did you use the $Scrape Chosen Attribute command on the body content, and is it necessary for grabbing content from the page I PM'd you about?

Thanks again,

Luke

Aaron Nimocks · April 4, 2010

So why did you use the $Scrape Chosen Attribute command on the body content, and is it necessary for grabbing content from the page I PM'd you about?

Thanks again,
Luke

I only use scrape chosen attribute when I dont want HTML or anything else in my results. If you use Scrape Chosen Attribute and you select inner text then you just get the text.

I didn't understand what you wanted from the PM. That is a simple website with like 10 pages. You could of scraped all that manually and been done with it already. A bigger explanation on what you are trying to do would help.

Luke · April 4, 2010

I only use scrape chosen attribute when I dont want HTML or anything else in my results. If you use Scrape Chosen Attribute and you select inner text then you just get the text.

Ahh... So a simple $scrape page command won't remove html formatting and line breaks... But $Scrape Chosen Results, at least when using innertext, allows for that. Is this right?

That being the case, it sounds like I would want to keep my html formatting for the sake of spacing out my paragraphs in the final blog post. Or is there a reason not to do this?

I didn't understand what you wanted from the PM. That is a simple website with like 10 pages. You could of scraped all that manually and been done with it already. A bigger explanation on what you are trying to do would help.

Well, the idea is for this to be a simple HTML-Site-To-Wordpress-Blog bot.

The overall bot is scraping up little sites like that one and making each page from it a blog post, much like your Ezine Articles 2 WP bot. I've already got it making a list of all URLs from the XML feed, and then it is looping through each url on it to add the contents of this scrape to the 2nd column of a CSV. (The first column is the afore-mentioned links scraped from the XML feed.)

I am assuming that since there are commas in the blocks of text, I'll need to be saving this as a CSV with PIPE (|) delimiting, not actually commas. Is this right?

I can then make it post the contents of the CSV just like yours does into posts with 1 day increment between them, and finally, I'd like it to use the Redirection Plugin to redirect the old URLs from column 1 of the CSV to the new blogposts... Haven't figured that part out yet, actually.

Anyway, it really is a lot like your bot except that instead of searching for articles, it just scrapes the URLs of a single domain, and then scrapes the content from those urls. Now that I think of it, this might make a great tutorial bot for your site.

Thanks again,

Luke

Luke · April 5, 2010

This is going from bad to worse...

Attached is a csv file I made of simply scraping (by attribute) the inner text between some <H2>title</h2> tags and saving them to a %list, and then saving that list as a .csv or txt file: (Both show the same contents, and I can't get either to open in a spreadsheet anyway!)

test.txt

What the heck am I looking at? It really should just be text in that file, right?

Aaron Nimocks · April 5, 2010

Why are you saving it to txt or CSV instead of just keeping it in the list and then posting it where you need?

Luke · April 5, 2010

Why are you saving it to txt or CSV instead of just keeping it in the list and then posting it where you need?

I suppose I could if I make this bot really, really long... I was planning on running this bot as a stand-alone to grab the content, run another bot to fully arrange and configure the blog, and then run a third bot to post this content in afterwards.

It would also give me the chance to manually clean the data if it needs it.

If it makes a big difference though, I'd be OK with consolidating, perhaps using more scripts inside a single bot, or re-arranging the order.

How big is a bots' usable threshold, anyway? I'm up to 500kb on my ubot file for the configuration part, and growing.

Luke · April 6, 2010

Oh Yeah, and let's not forget having a backup, too...

I can't imagine doing this without saving a backup of the old site this will be destroying and replacing. I guess I wouldn't be ok with consolidating afterall. :unsure:

Do you have any ideas what is going on with that text.txt file above? Why can't I simply scrape a page and save it to file in a usable format?

Scraping BIG stuff

Recommended Posts

Luke 18

Link to post

Share on other sites

Frank 177

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Aaron Nimocks 19

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Aaron Nimocks 19

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Aaron Nimocks 19

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Luke 18

Link to post

Share on other sites

Join the conversation