how do you clean up all sorts of html when scraping?

zonfar · May 21, 2011

When I scrape, sometimes I get random pieces of unwanted html every time. i'm just looking for a solution to rid all these and have regular readable text. Usually I just use the $replace for little things, but that doesn't seem to support the * wildcards.

UBotBuddy · May 21, 2011

Can you post an example bot? Many times it is how you are scraping that determines what is being grabbed.

zonfar · May 21, 2011

Can you post an example bot? Many times it is how you are scraping that determines what is being grabbed.

Thanks! Thats all you had to say for me to realize that its better to choose scrape by the attribute rather than the regular page scrape!

Seth Turin · May 22, 2011

Super Dave · May 23, 2011

Also keep in mind that there will be elements that are difficult to pull off a page all at once. In these cases it might be useful to scrape the surrounding HTML(the easier stuff) then use the write-to-browser feature to spit it back out one at a time. This will let you scrape from within generic elements that otherwise would give you tons of false matches.

Sign In

how do you clean up all sorts of html when scraping?

Recommended Posts

zonfar 1

Link to post

Share on other sites

UBotBuddy 331

Link to post

Share on other sites

zonfar 1

Link to post

Share on other sites

Seth Turin 224

Link to post

Share on other sites

Super Dave 26

Link to post

Share on other sites

Join the conversation

Browse

Activity