Jump to content
UBot Underground

Any ideas on this


Recommended Posts

I am compiling a report here at work from our main website and UBot typically does really well but this problem I am having a bit of a problem.

 

The content was apparently created by Word and it has those stupid characters embedded throughout the content. You know, the reverse quotes. Those kind of characters.

 

Anyway, when I am building my list the $search function shows them in the box ok but when I apply it to the Add List the characters appear. I have tried using Choose Attribute as well and then I use innertext, value, or even outertext but the effect is still the same.

 

I have also tried using javascript to do a search & replace but for some reason it is not working. Not sure if that is due to the Beta version.

 

I have done Search & Replace before so I am pretty sure its not working using these high characters.

 

Very strange indeed.

Link to post
Share on other sites

Really depends on how many different characters you need to remove before you page scrap

 

Choose attribute

Attribute

Innertext

SearchString

Bad characters

Method

exact match

 

change choose attribute

Attribute

Innertext

 

Newvalue

$nothing

Link to post
Share on other sites

One thing you can do, since the characters are strange and kind of hard to pin down, is to use wildcards in the choose by attribute command. Wild cards are asterisks which represents any number of characters and a question mark represents just one character.

 

For example,

if you are trying to scrape this sentence:

 

-the dog bit the "sausage"

 

You could find it by having your search text as:

 

-the dog bit the ?sausage?

 

which would then scrape the sentence with the word "sausage", with the two question marks taking the place of the quotation marks.

 

It is best to use question marks because they are cleaner than using an asterisk.

Link to post
Share on other sites

Wildcards are not the problem. It is the result of the scrape. I see Quotes in the content but after the scrape I see odd characters (which come from MS Word). I cannot predict where those characters will appear to even use the wildcards since it is Content related.

 

I wish it would work.

 

Buddy

Link to post
Share on other sites

Attached is a little example I put together that uses a javascript regular expression to strip all characters that aren't on a 'whitelist'

 

and here's the javascript for easy copypasting (goes in an $eval)

"{1}".replace(/[^a-zA-Z 0-9\-_\`\~\!\@\#\$\%\^\&\*\(\)\.\,\:\;\?\']+/g,"");

{1} being your scraped data variable

strip_crazy_characters.ubot

  • Like 3
Link to post
Share on other sites

Word does a lot of very screwy things. I've had issues in many other places too. Maybe saving out of word in 'open document format' may help. I believe that this is the same format that openoffice uses.

 

BUT, that's just a big guess.

 

Frank

Link to post
Share on other sites

I wish I could use OpenOffice for this but I am scraping from the site not building it. My goal is to take this data and autoload it on another site. Its a joke site so these are small paragraphs that I am grabbing.

 

http://www.ajokeaday.com/ChisteDelDia.asp

 

Buddy

Link to post
Share on other sites

What are you saving it to?

I am thinking excel as when i save it to a txt file them bad characters turn out to be , " ! and so on

so maybe saving to a .txt file then reading it back into ubot to save in your fomat

Link to post
Share on other sites

Attached is a little example I put together that uses a javascript regular expression to strip all characters that aren't on a 'whitelist'

 

and here's the javascript for easy copypasting (goes in an $eval)

"{1}".replace(/[^a-zA-Z 0-9\-_\`\~\!\@\#\$\%\^\&\*\(\)\.\,\:\;\?\']+/g,"");

{1} being your scraped data variable

 

Thanks for this Jim! Another gem from the Javascript expert :-)

  • Like 1
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...