TrafficScript rules often need to refer to tables of data - redirect mappings, user lists, IP black lists and the like.
For small tables that are not updated frequently, you can place these inline in the TrafficScript rule:
$redirect = [ "/widgets" => "/sales/widgets", "/login" => "/cgi-bin/login.cgi" ]; $path = http.getPath(); if( $redirect[ $path ] ) http.redirect( $redirect[ $path ] );
This approach becomes difficult to manage if the table becomes large, or you want to update it without having to edit the TrafficScript rule. In this case, you can store the table externally (in a resource file) and reference it from the rule:
The following examples will consider a file that follows a standard space-separated 'key value' pattern, and we'll look at alternative TrafficScript approaches to efficiently handle the data and look up key-value pairs:
# cat /opt/zeus/zxtm/conf/extra/redirects.txt /widgets /sales/widgets /login /cgi-bin/login.cgi /support http://support.site.com
We'll propose several 'ResourceTable' TrafficScript library implementations that express a lookup() function that can be used in the following fashion:
# ResourceTable provides a lookup( filename, key ) function import ResourceTable as table; $path = http.getPath(); $redirect = table.lookup( "redirects.txt", $path );
We'll then look at the performance of each to see which is the best.
For a summary of the solutions in this article, jump straight to libTable.rts: Interrogating tables of data in TrafficScript.
ResourceTable1
sub lookup( $filename, $key ) { $contents = resource.get( $filename ); if( string.regexmatch( $contents, '\n'.$key.'\s+([^\n]+)' ) ) return $1; if( string.regexmatch( $contents, '^'.$key.'\s+([^\n]+)' ) ) return $1; return ""; }
This simple implementation searches the file on each and every lookup, using a regular expression to locate the key and also the text on the remainder of the line. It pins the key to the start of the line so that it does not mistakenly match lines where $key is a substring (suffix) of the key.
The implementation is simple and effective, but we would reasonably expect that it would become less and less efficient, the larger the resource file became.
The following code builds a TrafficScript hash table from the contents of the resource file:
$contents = resource.get( $filename ); $h = [ ]; foreach( $l in string.split( $contents, "\n" ) ) { if( ! string.regexmatch( $l, '(.*?)\s+(.*)' ) ) continue; $key = string.trim( $1 ); $value = string.trim( $2 ); $h[$key] = $value; }
You can then quickly look up values in the hash table using $h[ $key ].
However, we don't want to have to create the hash table every time we call the lookup function; we would like to create it once and then cache it somewhere. We can use the global data table to store persistent data, and we can verify that the data is still current by checking that the MD5 of the resource file has not changed:
ResourceTable2a
sub update( $filename ) { # Store the md5 of the resource file we have cached. No need to update if the file has not changed $md5 = resource.getMD5( $filename ); if( $md5 == data.get( "resourcetable:".$filename.":md5" ) ) return; # Do the update $contents = resource.get( $filename ); $h = [ ]; foreach( $l in string.split( $contents, "\n" ) ) { if( ! string.regexmatch( $l, "(.*?)\\s+(.*)" ) ) continue; $key = string.trim( $1 ); $value = string.trim( $2 ); $h[$key] = $value; } data.set( "resourcetable:".$filename.':data', $h ); data.set( "resourcetable:".$filename.':md5', $md5 ); } sub lookup( $filename, $key ) { # Check to see if the file has been updated, and update our table if necessary update( $filename ); $h = data.get( "resourcetable:".$filename.':data' ); return $h[$key]; }
Version 2a: we store the MD5 of the file in the global key 'resourcetable:filename:md5',
and the hash table in the global key 'resourcetable:filename:data'.
This implementation has one significant fault. If two trafficscript rules are running concurrently, they may both try to update the keys in the global data table and a race condition may result in inconsistent data. This situation is not possible on a single-core system with one zeus.zxtm process because rules are run serially and only pre-empted if they invoke a blocking operation, but it's entirely possible on a multi-core system, and TrafficScript does not implement mutexes or locks to help protect against this.
The simplest solution is to give each core its own, private copy of the data. Because system memory should be scaled with the number of cores, the additional overhead of these copies is generally acceptable:
ResourceTable2b:
sub update( $filename ) { $pid = sys.getPid(); $md5 = resource.getMD5( $filename ); if( $md5 == data.get( "resourcetable:".$pid.$filename.":md5" ) ) return; $contents = resource.get( $filename ); $h = [ ]; foreach( $l in string.split( $contents, "\n" ) ) { if( ! string.regexmatch( $l, "(.*?)\\s+(.*)" ) ) continue; $key = string.trim( $1 ); $value = string.trim( $2 ); $h[$key] = $value; } data.set( "resourcetable:".$pid.$filename.':data', $h ); data.set( "resourcetable:".$pid.$filename.':md5', $md5 ); } sub lookup( $filename, $key ) { update( $filename ); $pid = sys.getPid(); $h = data.get( "resourcetable:".$pid.$filename.':data' ); return $h[$key]; }
Version 2b: by including the pid in the name of the key, we avoid multi-core race conditions
at the expense of multiple copies of the date
data.set and data.get address a global key/value table. We could use that directly, rather than constructing a TrafficScript hash:
sub update( $filename ) { $pid = sys.getPid(); $md5 = resource.getMD5( $filename ); if( $md5 == data.get( "resourcetable".$pid.$filename.":md5" ) ) return; data.reset( "resourcetable".$pid.$filename.":" ); data.set( "resourcetable".$pid.$filename.":md5", $md5 ); $contents = resource.get( $filename ); foreach( $l in string.split( $contents, "\n" ) ) { if( ! string.regexmatch( $l, "(.*?)\\s+(.*)" ) ) continue; $key = string.trim( $1 ); $value = string.trim( $2 ); data.set( "resourcetable".$pid.$filename."::".$key, $value ); } } sub lookup( $filename, $key ) { update( $filename ); $pid = sys.getPid(); return data.get( "resourcetable".$pid.$filename."::".$key ); }
Version 3: key/value pairs are stored in the global data table. Keys begin with the string "resourcetableid:filename:", so it's easy to delete all of the key/value pairs using data.reset() before rebuilding the dataset
We tested the number of lookups-per-second that each implementation could achieve (using a single-core virtual machine running on a laptop Core2 processor) to investigate performance for different dataset sizes:
Resource file size (entries) | 10 | 100 | 1,000 | 10,000 |
---|---|---|---|---|
Implementation 1: simple search | 300,000 | 100,000 | 17,500 | 1,000 |
Implementation 2: trafficscript hash, cached in global data table | 27,000 | 2,000 | 250 | 10 |
Implementation 3: key/value pairs in the global data table | 200,000 | 200,000 | 200,000 | 200,000 |
ResourceTable lookups per second (single core, lightweight processor)
The test just exercised the rate of lookups in resource files of various sizes; the cost of building the cached datastructures (implementations 2 and 3) and one-off costs and not included in the tests.
The degradation of performance in implementation 1 as the file size increases was to be expected.
The constant performance of implementation 3 was as expected, as hash tables should generally give O(1) lookup speed, not affected by the number of entries.
The abysmal performance of implementation 2 is surprising, until you note that on every lookup we retrieve the entire hash table from the global data table:
$h = data.get( "resourcetable:".$pid.$filename.':data' ); return $h[$key];
The global data table is a key/value store; all keys and values are serialized as strings. The data.get() operation will read the serialized version of the hash table and reconstruct the entire table (up to 10,000 entries) before the O(1) lookup operation.
What is most surprising perhaps is the speed at which you can search and extract data from a string using regular expressions (implementation 1). For small and medium datasets (up to approx 50 entries), this is the simplest and fastest method, and it's only worth considering the more complex data.get key/value implementation for large datasets.