UBotBuddy 331 Posted June 10, 2010 Report Share Posted June 10, 2010 I am compiling a report here at work from our main website and UBot typically does really well but this problem I am having a bit of a problem. The content was apparently created by Word and it has those stupid characters embedded throughout the content. You know, the reverse quotes. Those kind of characters. Anyway, when I am building my list the $search function shows them in the box ok but when I apply it to the Add List the characters appear. I have tried using Choose Attribute as well and then I use innertext, value, or even outertext but the effect is still the same. I have also tried using javascript to do a search & replace but for some reason it is not working. Not sure if that is due to the Beta version. I have done Search & Replace before so I am pretty sure its not working using these high characters. Very strange indeed. Quote Link to post Share on other sites
Pete 121 Posted June 10, 2010 Report Share Posted June 10, 2010 Really depends on how many different characters you need to remove before you page scrap Choose attribute Attribute InnertextSearchString Bad charactersMethod exact match change choose attribute Attribute Innertext Newvalue $nothing Quote Link to post Share on other sites
UBotBuddy 331 Posted June 10, 2010 Author Report Share Posted June 10, 2010 Yep, that's what I did as well but it no workie. Quote Link to post Share on other sites
MiriamMB 63 Posted June 10, 2010 Report Share Posted June 10, 2010 One thing you can do, since the characters are strange and kind of hard to pin down, is to use wildcards in the choose by attribute command. Wild cards are asterisks which represents any number of characters and a question mark represents just one character. For example,if you are trying to scrape this sentence: -the dog bit the "sausage" You could find it by having your search text as: -the dog bit the ?sausage? which would then scrape the sentence with the word "sausage", with the two question marks taking the place of the quotation marks. It is best to use question marks because they are cleaner than using an asterisk. Quote Link to post Share on other sites
UBotBuddy 331 Posted June 10, 2010 Author Report Share Posted June 10, 2010 Wildcards are not the problem. It is the result of the scrape. I see Quotes in the content but after the scrape I see odd characters (which come from MS Word). I cannot predict where those characters will appear to even use the wildcards since it is Content related. I wish it would work. Buddy Quote Link to post Share on other sites
Guest Jim Posted June 10, 2010 Report Share Posted June 10, 2010 Attached is a little example I put together that uses a javascript regular expression to strip all characters that aren't on a 'whitelist' and here's the javascript for easy copypasting (goes in an $eval)"{1}".replace(/[^a-zA-Z 0-9\-_\`\~\!\@\#\$\%\^\&\*\(\)\.\,\:\;\?\']+/g,""); {1} being your scraped data variablestrip_crazy_characters.ubot 3 Quote Link to post Share on other sites
alcr 135 Posted June 10, 2010 Report Share Posted June 10, 2010 Is it UTF8 characters or is the symbols messed up in word? Because 1. UBot does not support UTF8 and 2. Use open office instead Quote Link to post Share on other sites
Frank 177 Posted June 10, 2010 Report Share Posted June 10, 2010 Word does a lot of very screwy things. I've had issues in many other places too. Maybe saving out of word in 'open document format' may help. I believe that this is the same format that openoffice uses. BUT, that's just a big guess. Frank Quote Link to post Share on other sites
UBotBuddy 331 Posted June 10, 2010 Author Report Share Posted June 10, 2010 I wish I could use OpenOffice for this but I am scraping from the site not building it. My goal is to take this data and autoload it on another site. Its a joke site so these are small paragraphs that I am grabbing. http://www.ajokeaday.com/ChisteDelDia.asp Buddy Quote Link to post Share on other sites
Pete 121 Posted June 11, 2010 Report Share Posted June 11, 2010 What are you saving it to? I am thinking excel as when i save it to a txt file them bad characters turn out to be , " ! and so onso maybe saving to a .txt file then reading it back into ubot to save in your fomat Quote Link to post Share on other sites
Net66 54 Posted June 11, 2010 Report Share Posted June 11, 2010 Attached is a little example I put together that uses a javascript regular expression to strip all characters that aren't on a 'whitelist' and here's the javascript for easy copypasting (goes in an $eval)"{1}".replace(/[^a-zA-Z 0-9\-_\`\~\!\@\#\$\%\^\&\*\(\)\.\,\:\;\?\']+/g,""); {1} being your scraped data variable Thanks for this Jim! Another gem from the Javascript expert :-) 1 Quote Link to post Share on other sites
UBotBuddy 331 Posted June 11, 2010 Author Report Share Posted June 11, 2010 Jim. You da'Man! Thanks that did the trick PERFECTLY! Quote Link to post Share on other sites
UBotBuddy 331 Posted June 11, 2010 Author Report Share Posted June 11, 2010 Hmmm.. It works BUT it totally strips out the first two jokes. I mean gone. But five are left perfectly. Quote Link to post Share on other sites
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.