Replace Special And Non-ASCII Characters Without the Performance Hit of Regular Expressions

09 Jul

Alright so if your reading this you should know what Regular Expressions are, Now if you don’t go and use Google then come back :). Alright now that that is out of the way. So I needed to replace any special characters in a string before I passed it to another method to process it this case it was for a search function. Now here is the thing yes you can declare a reg-ex and then do a string.replace with the reg-ex but that caused a sever performance hit as it iterates through the whole expression for each character in the string. Alright now what about the people who use non-ASCII characters(yes some people do for some reason) with the code below it removes them also unlike reg-ex and by using the .net methods below you decrease the performance hit by over 2 fold.

For me this was a big deal because the data-source I was working with was huge. Now here is a basic truth you need to accept AND NOT DO WITH MY CODE before I give it to you, Strings in .net are Immutable if you don’t know what that is read the last part of the first sentence in this writeup. Alright…. if you are going to replace the special characters with quote quote DO NOT use “” as for each replacement it creates a separate string object until it leaves the for loop. Use String.Empty as the input parameter. So here you go in and C#….

    Public Shared Function replaceSpecialCharactersWith(text As String, replacement As String) As String
        Dim sb As New StringBuilder()
        Dim lastWasInvalid = False
        For Each c As Char In text
            If Char.IsLetterOrDigit(c) Then
                lastWasInvalid = False
                If Not lastWasInvalid Then
                End If
                lastWasInvalid = True
            End If

        Return sb.ToString().ToLowerInvariant().Trim()

    End Function


public static string replaceSpecialCharactersWith(string text, string replacement)
	StringBuilder strb = new StringBuilder();
	object lastWasInvalid = false;
	foreach (char c in text) 
		if (char.IsLetterOrDigit(c)) 
			lastWasInvalid = false;
			if (!lastWasInvalid) 
			lastWasInvalid = true;

So for those who don’t understand it just in case, what it is doing is converting the string to a character array and looping through the character array and inserting the designated value for each if the if condition is met. If the character is valid it adds it to the string builder and moves to the next character. If the character is invalid it replaces it in the string builder with the specified replacement and marks the last character as invalid. The last part is so you don’t put 2 replacements side by side it just replaces once and goes till it finds another valid character :). Have fun and till next time don’t let your code have a meltdown.

Leave a comment

Posted by on July 9, 2012 in C#, Entity Framework


Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: