I develop several web sites with dictionary search engines where you can search for words in a lots of different ways. Because of the flexible searches with regular expressions, normal indexes in databases engines are almost useless.
The current solution is to load all objects into memory. The search data is fairly static and does not grow much, so it works quite well. The current solution finds the results most of the time in a few hundreds milliseconds (0.001 to 0.5 seconds).
Recently I have been looking at the graph database Neo4j, but because of the kind of searches that are done on the pages the results on test data was incredibly poor. Most queries took 2 to 9 seconds and that was on test data that was only a small part of the real data and only 5% of the relations was generated. I might give it another try with real data when Neo4j 2.0 is released as stable, but I don’t think I will perform well with this special case.
So the next step is to do some memory indexing on my own. The first question was to find out which of the collections in .NET is the fastest and which uses the least memory.
The tests were all done on my laptop with as few programs running as possible. This would make the results fairly reliable, but I can’t guarantee that the results are fully correct. But I trust them and will base my future development upon them.
The time column shows the total time of a fixed number of lookups.
One string stored 10 million times with <int> Key
This test stores the same string in all the records in the collection. Because of this the difference in size should only reflect the key index.
|
Size (MB) |
Time (ms) |
Dictionary<int, string> |
267 |
235 |
SortedDictionary<int, string> |
534 |
4484 |
Hashtable |
547 |
1851 |
SortedList<int, string> |
114 |
2156 |
10 million unique strings stored with <int> Key
This test stores different strings in all the records in the collections. This will create a more realistic test of the memory usage.
|
Size (MB) |
Time (ms) |
Dictionary<int, string> |
706 |
368 |
SortedDictionary<int, string> |
973 |
4590 |
Hashtable |
985 |
2286 |
SortedList<int, string> |
553 |
2122 |
10 million unique strings stored with <string> Key
If I could have strings as keys I would be able to use the collections in more ways than one. But as you can see some of the lookups became more than 10 times slower than in the tests above.
|
Size (MB) |
Time (ms) |
Dictionary<int, string> |
706 |
1957 |
SortedDictionary<int, string> |
973 |
66057 |
Hashtable |
757 |
2950 |
SortedList<int, string> |
591 |
74534 |
1 million sub-collections with 10 strings each with <int> Key
This was another test I did to see if it would be useful to store the data in a tree, so each records in the collection holds another collection. This is a special case for my needs and I don’t expect anyone else to have any use of it. But I did the test, so I give you the numbers.
|
Size (MB) |
Time (ms) |
Dictionary<int, string> |
431 |
273 |
SortedDictionary<int, string> |
656 |
801 |
Hashtable |
733 |
1634 |
SortedList<int, string> |
245 |
618 |
Conclusion
The collection with the best performance is Dictionary<> and the collection with most compact footprint in SortedList<>.
The most surprising to me was that the SortedDictionary<> performed so badly.
References
In an earlier post i linked to this blog that has a great post about performance of the different collections. It has a nice list with all the different collections, but does not talk anything about the memory usage.
http://geekswithblogs.net/BlackRabbitCoder/archive/2011/06/16/c.net-fundamentals-choosing-the-right-collection-class.aspx