Search Extensions: Soundex performance analysis

3 min read

NinjaNye.SearchExtensions : Performance Analysis

This post is to explain, test and analyse the performance of the Soundex functionality within NinjaNye.SearchExtensions.

###Test environment All of the test results are from my development machine with the following specification:

  • Intel Core i5-3317U CPU @ 1.70GHz
  • 10Gb RAM
  • Windows 8.1 64bit operating system

###Test Setup The tests were all performed against 1 million randomly generated words ranging from 2 to 10 characters. Each test had a new set of randomly generated words with which to work on. This processing was not included in the timing results.

Building the random words was done as follows:

private List<string> words;
private void BuildWords(int wordCount)
{
    Console.WriteLine("Building {0} words...", wordCount);
    this.words = new List<string>();
    for (int i = 0; i < wordCount; i++)
    {
        string randomWord = this.BuildRandomWord();
        this.words.Add(randomWord);
    }
}

private const string letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
private string BuildRandomWord()
{
    var letterCount = RandomInt(2, 10);
    var sb = new StringBuilder(letterCount);
    for (int i = 0; i < letterCount; i++)
    {
        var letterIndex = RandomInt(0, 51);
        sb.Append(letters[letterIndex]);
    }
    return sb.ToString();
}

private static int RandomInt(int min, int max)
{
    var rng = new RNGCryptoServiceProvider();
    var buffer = new byte[4];

    rng.GetBytes(buffer);
    int result = BitConverter.ToInt32(buffer, 0);

    return new Random(result).Next(min, max);
}

The above methods meant I was simply able to have the following [SetUp] action performed on before every test.

[SetUp]
public void Setup()
{
    this.BuildWords(1000000);
}

###The tests All the output you see below is taken from a single sample test run. These figures will vary slightly on each run since the source data is never exactly the same ####Converting 1 million words to soundex

[Test]
public void ToSoundex_OneMillionRecords_UnderOneSecond()
{
    //Arrange
    Console.WriteLine("Processing {0} words", words.Count);
    var stopwatch = new Stopwatch();
    Console.WriteLine("Begin soundex...");
    stopwatch.Start();

    //Act
    var result = words.Select(x => x.ToSoundex()).ToList();
    stopwatch.Stop();
    Console.WriteLine("Time taken: {0}", stopwatch.Elapsed);
    Console.WriteLine("Results retrieved: {0}", result.Count());
    //Assert
    Assert.True(stopwatch.Elapsed.TotalMilliseconds < 1000);
}

Output

Building 1000000 words...
Processing 1000000 words
Begin soundex...
Time taken: 00:00:00.6148250
Results retrieved: 1000000

####Querying 1 million words that sound like 'test'

[Test]
public void SearchSoundex_OneMillionWordsComparedToOneWord_UnderOneSecond()
{
    //Arrange
    Console.WriteLine("Processing {0} words", words.Count);

    var stopwatch = new Stopwatch();
    Console.WriteLine("Begin soundex search...");
    stopwatch.Start();

    //Act
    var result = words.Search(x => x).Soundex("test").ToList();
    stopwatch.Stop();
    Console.WriteLine("Time taken: {0}", stopwatch.Elapsed);
    Console.WriteLine("Results retrieved: {0}", result.Count);
    //Assert
    Assert.True(stopwatch.Elapsed.Milliseconds < 1000);
}

Output

Building 1000000 words...
Processing 1000000 words
Begin soundex search...
Time taken: 00:00:00.6203847
Results retrieved: 554

####Querying 1 million words that sound like 'test' or 'bacon'

[Test]
public void SearchSoundex_OneMillionWordsComparedToTwoWords_UnderOneSecond()
{
    //Arrange
    Console.WriteLine("Processing {0} words", words.Count);

    var stopwatch = new Stopwatch();
    Console.WriteLine("Begin soundex search...");
    stopwatch.Start();

    //Act
    var result = words.Search(x => x).Soundex("test", "bacon").ToList();
    stopwatch.Stop();
    Console.WriteLine("Time taken: {0}", stopwatch.Elapsed);
    Console.WriteLine("Results retrieved: {0}", result.Count);
    //Assert
    Assert.True(stopwatch.Elapsed.Milliseconds < 1000);
}

Output

Building 1000000 words...
Processing 1000000 words
Begin soundex search...
Time taken: 00:00:00.4232135
Results retrieved: 1230

####Querying 1 million words that sound like any of ten supplied words [Test] public void SearchSoundex_OneMillionWordsComparedToTenWords_UnderOneSecond() { //Arrange Console.WriteLine("Processing {0} words", words.Count);

    var stopwatch = new Stopwatch();
    Console.WriteLine("Begin soundex search...");
    stopwatch.Start();

    //Act
    var result = words.Search(x => x).Soundex("historians", "often", "articulate", "great", "battles",
                                              "elegantly", "without", "pause", "for", "thought").ToList();
    stopwatch.Stop();
    Console.WriteLine("Time taken: {0}", stopwatch.Elapsed);
    Console.WriteLine("Results retrieved: {0}", result.Count);
    //Assert
    Assert.True(stopwatch.Elapsed.Milliseconds < 1000);
}

Output

Building 1000000 words...
Processing 1000000 words
Begin soundex search...
Time taken: 00:00:00.5468836
Results retrieved: 7149

Pedal to the metal

I am really pleased with these performance results. I have not seen how this performs against other Soundex processors out there but I hope it will be competitive. It was not always this perfomant and has only got to this stage through a lot of refactoring that was backed up by my test cases.

You can see the full source code for the SoundexProcessor the projects github repository

Performance tweaks

####Regex is cool! but slooooooooow An initial implementation of the SoundexProcessor used Regex. This was great as it enabled me to easily pass my tests, often by simply updating the regex codes. However, when it came to performance tests, processing 1 million records was taking close to 20 seconds.... With my tests as a safety net I was able to rewrite the functionality the regex pattern was providing by simply analysing each charachter myself.

char.ToUpper() vs char.IsUpper()

Click here for the code this relates section relates to

Another small improvement I made was instead of converting each character to it's upper counterpart and comparing that character to a set of upper case characters I simply check if it is an upper case character and perform the relevant checks based off of that character. This small change saved around 20% of the total running time at that point in time,


If you would like to know more about the soundex search functionality, please get in touch by adding a comment below or you can contact me on twitter (@ninjanye)