Jump to content
UBot Underground

Recommended Posts

I am scraping data from various websites, some having non-English characters in names, etc...

 

I would like to be able to save the scraped data in a normalized format, such as, for instance:

  • Nürnberg would become Nurnberg
  • Orléans would become Orleans
  • Matalascañas would become Matalascanas
  • etc...

 

Any easy way to do this?

Ideas?

 

Thanks in advance!

Link to post
Share on other sites

You could also try navigating to an online converter and perform the function there...

Thanks Duane...

I thought of something like that and I have to THANK YOU for providing those 3 useful links...

 

However, I am not able to find any character set that would work the way I want.

 

The sites you mentioned, would leave:

 

Kissa käveli öisellä kadulla

 

unchanged, instead of making it look like:

 

Kissa kaveli oisella kadulla
Link to post
Share on other sites

This is not a quick way but you could scrape the data and then use a bunch of if statements.

 

Each if statement should check if the text contains a specific character and if it has it then replace it with your desired character.

 

The quick way to make all the if statements is to go in code view and copy and paste the code and replace the characters.

 

Maybe there is a REGEX that would replace them all?

 

I am NOT a Regex expert, but if you have any ideas along this line, please share... Thanks!

Link to post
Share on other sites

Actually, in regards to the Regex expression, obviously I should probably loop (with a specific Regex) through each different character I want replaced, I guess...

Link to post
Share on other sites

Looks like I found the page that'd do it automatically:

 

Mañana -> Manana

 

http://unicode.org/c...-ASCII&b=Mañana

 

Nürnberg -> Nurnberg

 

http://unicode.org/c...SCII&b=Nürnberg

 

Now I'll write code to scrape the result from that page

Link to post
Share on other sites

Now I'll only check IF the scraped text includes any of the "áéíóúñüäö" etc... codes (I would need a Regex for those though, still...

and IF it does, navigate to the page and scrape transformation.

 

I suspect there may be many more than only "áéíóúñüäö" variations, so this could be a rather long Regex to build, but I suppose it would be useful for many people...

Link to post
Share on other sites

Ok, here is the code for the Function to transform the non Latin ASCII characters (or like I said initially, the non English characters, in fact) into normalized LATIN-ASCII text:

 

define $LatinASCII(#var_TMP_ANYtxt) {
   set(#var_TMP_LATINtxt, #var_TMP_ANYtxt, "Global")
   set(#var_CheckLatin, $replace regular expression(#var_TMP_LATINtxt, "[áéíóúñüäö]", ""), "Local")
   if($comparison($text length(#var_CheckLatin), "<", $text length(#var_TMP_LATINtxt))) {
    then {
	    navigate("http://unicode.org/cldr/utility/transform.jsp?a=Latin-ASCII&b={#var_TMP_ANYtxt}", "Wait")
	    set(#var_TMP_LATINtxt, $scrape attribute(<innertext=w"*
Result
*">, "innertext"), "Global")
	    set(#var_TMP_LATINtxt, $trim($substring(#var_TMP_LATINtxt, $add($find index(#var_TMP_LATINtxt, "Result"), 6), $find index(#var_TMP_LATINtxt, "Fonts and"))), "Global")
	    set(#var_TMP_LATINtxt, $trim($substring(#var_TMP_LATINtxt, 0, $find index(#var_TMP_LATINtxt, "Fonts and"))), "Global")
    }
    else {
    }
   }
   return(#var_TMP_LATINtxt)
}

 

The function may be called within any other command, function, etc...

 

set(#TESTString, $LatinASCII("áéíóúñüäö"), "Global")

 

It will navigate to the page for transformation ONLY when there is such non-LATIN character within the text, thus reducing the traffic and that page's load quite a lot.

Link to post
Share on other sites

Now, REGEX experts, please... chime in:

 

How do I make the REGEX work for both lower AND UPPER case character, w/o writing the UPPER chars in it too?

 

Is there such a Regex option to allow that? I read somewhere (but I didn't manage to understand) that "i" might be used for such a result?

 

If that's the case, how would the new REGEX look like using this "i" option?

Thanks!

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...