Author: |
Aaron Watters |
Organization: |
Computer and Information Sciences, New Jersey Institute of Technology, University Heights, Newark, NJ, 07102 (address obsolescent). |
Version: |
1.1.1.1 |
Abstract
This is the documentation for the kjbuckets C extension to Python (second release), which defines graph and set datatypes as well as an alternative dictionary data type. These types are tightly coupled at the level of C, allowing fast and powerful algebraic combinations of container objects.
This is modified version of kjbuckets. Modifications are:
updated for Python 2.0 (Berthold Hoellmann <bhoel@starship.python.net>)
now have Makefile.in/Setup for old-style compilation/configuration
now have setup.py for new-style compilation/configuration (Distutils) (Oleg Broytmann <phd2@earthling.net>)
kjbuckets.pyd Windows DLL for Python 2.0
sqlsem.py patch to use it with kjbuckets.pyd (Adnan Merican <admin@espventure.com>)
kjbuckets.pyd Windows DLL for Python 2.1
kjbucketsmodule.c patches for Python 2.1 (ActiveState 2.1.1, actually) (spex66 <spex66@gmx.de>)
kjbuckets.pyd Windows DLL for Python 2.2 (jfarr" <jfarr@speakeasy.org>)
Contents
The kjbuckets module defines three data types for Python: kjSet, kjGraph, and kjDict. These types come with a number of associated methods, including common set theoretical operations such as union, intersection, difference, composition, transposition, reachability sets, and transitive closure.
For suitably large compute intensive uses these types should provide up to an order of magnitude speedup versus an implementation that uses analogous operations implemented directly in Python.
The following discussion assumes the kjbuckets module has been compiled and installed in the Python executable. For information on how to perform such an installation, see the Python extensions manual that comes with the Python distribution.
Release 2.2 contains a number of goodies not documented here. If you want, you can try to figure them out from looking at the code!
Release 2.1 had a problem linking under Python 1.2. This has been fixed in 2.2.
This module defines three types
are initialized using the function kjbuckets.kjSet(). They are containers for Python hashable objects with no significance to redundancy and no order to members. For example _[#]:
>>> from kjbuckets import * >>> X = kjSet([1,2,3,3,5,4]); print X kjSet([1, 4, 3, 2, 5]) >>> Y = kjSet([5,5,3,3,2,1,4,4,4]); print Y kjSet([1, 4, 3, 5, 2]) >>> X == Y 1
are initialized using the function kjbuckets.kjGraph(). They relate Python hashable objects to other objects, with no significance to order or redundancies on the pairings. Technically, kjGraph defines a directed graph abstract data type. For example:
>>> G1 = kjGraph([(1,1),(1,2),(2,4),(9,6),(2,4)]); print G1 kjGraph([(1, 1), (1, 2), (9, 6), (2, 4)]) >>> G1.reachable(1) kjSet([1, 4, 2])
are initialized using the function kjbuckets.kjDict(). They map hashable objects to other objects, in a manner similar to the Python builtin Dictionary data type, except that the kjbucket implementation is slower. That is, it is slower if you use it just like another Python dictionary. It's a lot faster if you want to do compositions, intersections, and so forth using dictionaries.
And with the new release the speed difference is not so great anymore -- about 20% slower on comparable operations -- and kjDict's tend to use less space than Python dictionaries for the same contents.
Example:
>>> D = kjDict([(1,1),(1,2),(2,4),(9,6),(2,4)]); print D kjDict([(1, 2), (9, 6), (2, 4)]) >>> D * D kjDict([(1, 4)])
[1] |
Most of the examples given here use numeric elements for ease of presentation, which is bad because it's boring. It's also bad because it leaves the impression that only simple things can be archived -- which is wrong. Remember that keys may be any hashable type (which even includes user defined classes which have a hash method defined), and for dictionaries and graphs the left members may be any Python object whatsoever. |
Each of the initialization functions accept four possible argument sequences:
Results in the creation of a smallest empty object of the requested type. For example kjSet(), creates the smallest possible empty kjSet.
As illustrated above, the structures may be initialized with a list or tuple of contents, where the elements of the sequence are tuples of form (hashable object, object) pairs for kjDicts and kjGraphs and just hashable objects for kjSets. The examples given here use lists as the top level structure for the sequence initialization form, but you can also use tuples. For example as in:
>>> kjDict( ( (1,2), (2,3), (2,4), (3,4) ) ) kjDict([(1, 2), (2, 4), (3, 4)]) >>> kjSet( (9,2,1,9,8,7,6,4) ) kjSet([9, 6, 1, 7, 4, 2, 8])
In the case of kjDicts if there are key collisions the resulting kjDict may be dirty.
If the initializer argument is another kjTable the result will be the input table "coerced" to the other type (or if the types match you will get "first-level" copy of the table. The new object will be a distinct table which shares object references with the input table. For example:
>>> G kjGraph([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (0, 5), (1, 0), (2, 1)]) >>> kjDict(G) kjDict([(0, 5), (1, 0), (2, 1), (3, 3), (4, 4)]) >>> kjSet(G) kjSet([0, 1, 2, 3, 4]) >>> G2=kjGraph(G) >>> G2 kjGraph([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (0, 5), (1, 0), (2, 1)]) >>> G[12]=3 >>> G kjGraph([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (0, 5), (1, 0), (2, 1), (12, 3)]) >>> G2 kjGraph([(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (0, 5), (1, 0), (2, 1)])
Coercing a graph to a dictionary where the graph maps the same object to several objects will produce a "dirty" dictionary with key collisions decided arbitrarily. Coercing a set to a graph or dictionary produces an "identity" containing (x,x) for each element x of the Set. Coercing a graph or dictionary to a set produces the set of keys (left members) from the graph or dictionary. To get the "set of arcs" from a graph use kjSet(G.items()) instead of kjSet(G).
This option refers to the internal implementation of the types. Internally these types are implemented using arrays. Sometimes these arrays need to be resized to a larger size before an insert can complete. By initializing using a single integer argument n, you request that the structure be large enough that no resize will be needed until after n inserts (in the absense of deletions). For example S = kjSet(1000) initializes a set that will not need to be resized until after 1000 inserts have completed.
However, since deletes sometimes trigger the array to resize to a smaller size, deleting an element from S before insert number 1000 may make resizing necessary anyway. Them's the breaks.
Using this option may save some time and prevent some unnecessary memory fragmentation, when the programmer can determine (or guess) the expected number of insertions, a priori.
There is a peculiar way to initialize a kjDict:
>>> kjUndump(("name","age"), ("aaron",12)) kjDict([('name', 'aaron'), ('age', 12)]) >>> kjUndump(("ssnum",),"123456789") kjDict([('ssnum', '123456789')])
This is a parallel operation to kjDict.dump which together are designed to make it easy to pack and unpack information from kjDicts, in particular for constructing database-style indices. There are two behaviors for this function. Called with arguments of form:
kjUndump( (key,), map )
(ie, the first argument is a tuple of length one and map is any object) the result is the same as:
kjDict( [ (key, map) ] )
Alternatively, called with two tuples of the same length with lengths larger than 1 the invocation:
kjUndump( (k1, k2, ..., kn), (m1, m2, ..., mn) )
produces the same result as:
kjDict( [ (k1,m1), (k2,m2), ..., (kn, mn) ] )
If the same key is mentioned twice in the first argument and the corresponding values in the second argument are not equal the result will be a dirty dictionary.
A table which has had a non-monotone update (ie, a deletion or a dictionary overwrite) is said to be "dirty." In particular any deletion makes a table dirty; and coercing a graph to a dictionary, or transposing a dictionary, or unioning a set or dictionary with a dictionary will produce dirty dictionaries if the computation results in any key collisions. To test whether a table is dirty use the X.Clean() method which produces X if X is clean, otherwise None. For example:
>>> G = kjGraph([(0, 0), (0, 1), (1, 4), (9, 9), (2, 5)]) >>> D = kjDict(G); print D; print D.Clean() kjDict([(0, 1), (1, 4), (9, 9), (2, 5)]) None >>> D2 = kjDict(D); print D2.Clean() kjDict([(0, 1), (1, 4), (9, 9), (2, 5)])
Here D is dirty because the coercion from a graph resulted in key collisions on 0, but the fresh copy D2 is not dirty. The result of an algebraic expression involving a dirty table will be dirty also, for example:
>>> D3 = D2 * D >>> print D3, D3.Clean() kjDict([(0, 4), (9, 9)]) None
Note that, for example kjDict([(1,2),(1,3)]) will be dirty but kjDict([(1,2),(1,2)]) is not, i.e., inserting the same pair twice is not considered a collision.
These types have a number of associated methods, operations, and accessors. For the purposes of discussion assume that S is a kjSet, D is a kjDict, and G is a kjGraph in the remainder. Furthermore assume X is an object of any of these types.
There are a number of methods associated with each member of these types.
respectively are membership tests for the types. Each returns 1 if the object or pair are members of the structure or 0 otherwise.
respectively add new members to the object. These are equivalent to G[src]=dst, D[arg]=map, S[ob]=1 but the former may be preferrable for graphs and sets since they are less misleading. This is an "in place" mutation operation -- it will raise an error if the object has been hashed.
respectively delete a pair from the structure or raise an error if the pair is not found. This is an "in place" mutation operation -- it will raise an error if the object has been hashed.
determines whether a given key value occurs in the structure. In the case of sets this is identical to the membership test. In the case of dictionaries and graphs the function tests whether key occurs as a left member of some pair in the structure and returns 1 if so, otherwise 0.
selects an arbitrary key from the structure. In the case of sets it returns an arbitrary member of the set. In the case of graphs and dictionaries it picks an arbitrary left member of a pair in the structure. This operation is useful for algorithms that begin "pick an arbitrary node of the graph..." This method is "nondeterministic" in the sense that tables with the same members may choose different keys.
determines whether X is a subset of Y. Returns 1 if so, else 0. X and Y may be of different types but may be confusing if one argument is a set and the other is not. If X is a set and Y is a graph or dictionary then subset will succeed if and only if Y contains (e,e) for each member e of X. If Y is a set and X is a graph or dictionary then subset will succeed if and only if every key of X is a member of Y.
returns a list of the objects y where (key, y) is a member of G. For example:
>>> G = kjGraph([(0, 0), (1, 1), (0, 4), (1, 5), (2, 2), (2, 6)]) >>> G.neighbors(1) [1, 5]
If the key is absent from the table the result will be the empty list. This method is also defined for dictionaries, where the only possible results are a unary list if the key is present or an empty list if the key is absent.
returns a kjSet of objects reachable on any path in the graph that begins at key. The key itself will occur in the result only if it lies on a loop of the graph. For example:
>>> G = kjGraph([(1, 0), (4, 1), (0, 2), (3, 2), (6, 3), (2, 4), (5, 0)]) >>> G.reachable(5) kjSet([0, 4, 1, 2])
Again this method is also defined for dictionaries. The method returns a kjSet rather than a list because this made sense to me at the time.
returns a list of the members of the structure. For example:
>>> X = kjSet([0, 1, 2, 0, 1]) >>> X.items() [1, 0, 2] >>> X = kjGraph([(3, 0), (2, 2), (1, 2), (2, 0), (2, 0), (3, 0)]) >>> X.items() [(1, 2), (3, 0), (2, 2), (2, 0)]
return the left members and right members of pairs in the graph G respectively. For example:
>>> G = kjGraph([(4, 8), (0, 9), (1, 10), (4, 9), (3, 7), (3, 8), (2, >>> 7)]) >>> G.keys() [4, 0, 1, 3, 2] >>> G.values() [8, 9, 10, 9, 7, 8, 7]
Note that keys eliminates redundancies, whereas values does not. These functions are also defined for dictionaries but are not defined for sets.
generates an "identity dictionary" from the set S, the graph containing exactly those members (x,x) where x is a member of S. For example, the following calculation determines the "self-loop" elements of G:
>>> G kjGraph([(0, 0), (0, 3), (0, 2), (1, 4), (9, 9), (2, 5)]) >>> I = kjSet(G).ident() >>> I & G kjGraph([(0, 0), (9, 9)])
(In the previous release ident produced a graph, but now that the algebraic operators have been generalized I opted to produce the more specific dictionary type. This operation is now redundant since it is the same as kjDict(S).)
generates the transitive closure graph derived from the graph G. For example:
>>> G = kjGraph([(1, 3), (4, 1), (3, 0), (3, 1)]) >>> G.tclosure() kjGraph([(1, 3), (4, 1), (1, 0), (1, 1), (4, 3), (3, 0), (3, 1), (3, 3), (4, 0)]
produces None if table X has experienced a non-monotone update (a deletion or a dictionary key collision) or was algebraically derived from a table that had experienced a non-monotone update, in all other cases it returns the table X itself. This is particularly useful for testing whether the unions of dictionaries or the transpose of a dictionary was unambiguous. For example:
>>> D = kjDict([('name', 'A. Watters'), ('ssn', 123)]) >>> D2 = kjDict([('ssn', 999), ('salary', 9000000)]) >>> D3 = D + D2; print D3 kjDict([('name', 'A. Watters'), ('ssn', 999), ('salary', 9000000)]) if D3.Clean() != None: ... print D3["name"], " makes ", D3["salary"] ... else: ... print "ambiguous dictionary union" ... ambiguous dictionary union
Relational natural join anyone?
force a table to appear to be clean or dirty respectively, both returning None. Included for completeness.
produces a dictionary that is the result of remapping D by X, but it produces None if the remapping causes a key collision. For example to rename keys l and f to lname and fname respectively, preserving ssn, equating ssn with enum, and disregarding all other keys for D we could write. For example:
>>> D = kjDict([("f","aaron"), ("l","watters"), ("m","robert"), ("ssn",123)] ) >>> G = kjGraph() >>> G["ssn"]="enum" >>> G = (G + ~G).tclosure() # symmetric and transitive closure >>> G["lname"] = "l"; G["fname"] = "f" >>> D.remap(G) kjDict([('enum', 123), ('ssn', 123), ('lname', 'watters'), ('fname', 'aaron')])
This may seem strange, but it can be a very useful way of transforming collections of dictionaries. This operation is exactly the same as kjDict(X*D).Clean() but faster. (I use it a lot, so I optimized it -- it can correspond to projection, equality selection, and renaming in the relational algebra).
packs right members of a dictionary into a compact form. This function has two behaviors:
>>> D = kjUndump(("name","age","ssn"), ("aaron",12,12345)) >>> D kjDict([('name', 'aaron'), ('age', 12), ('ssn', 12345)]) >>> D.dump(("ssn",)) 12345 >>> D.dump(("name","ssn")) ('aaron', 12345)
Called with an argument of form:
D.dump( (key,) )
(ie, a tuple of length one) it produces the same result as:
D[key]
Alternatively, called with an argument of form:
D.undump( (k1, k2, ..., kn) )
(ie, a tuple of length greater than one) it produces that same result as:
( D[k1], D[k2], ..., D[kn] )
This function is the parallel operation to the dictionary initializer kjUndump, which together are designed to make it easy to pack and unpack information from kjDicts. It is also defined on graphs, in which case the choice of for the resulting mapped items may be arbitrary.
return the number of entries in X (which is the number of pairs in the case of graphs or dictionaries).
deletes the key from the structure. In the case of sets, this simply removes an element. In the case of dictionaries and graphs this method removes all entries with left member key. For example:
>>> G = kjGraph([(1, 3), (4, 1), (3, 0), (3, 1)]) >>> del G[3] >>> G kjGraph([(1, 3), (4, 1)])
This is an "in place" mutation operation -- it will raise an error if the object has been hashed.
These types are hashable, that is, they may be used as keys in hash structures and you may apply the function hash(X) to them. The kjGraph and kjDict structures also allow hashing even if some of their right members are unhashable. The "down side" of this "hashing unhashables" feature is that if two structures of the same type only differ on their unhashable right members they will hash to the same value -- which can make hash table look-ups slow. A "rule of thumb" is to only use kjDicts and kjGraphs as keys of a hash table structure if the set of keys is expected to nearly always differ on hashable components.
However, once a table's hash value has been computed for any reason, that table becomes immutable -- any attempts to mutate the structure in place (using index assignment, del, X.delete_arc, or X.add) will raise a TypeError.
Objects of these types may be compared for equality where X==Y succeeds if and only if X and Y contain the same members. Mixed type equality comparisons between kj-tables are allowed, where if S==D succeeds if and only if D consists of the pairs (e,e) for each element e of S, and similarly for S==G.
Objects of these types may also be used as booleans where only an empty structure is equivalent to false.
One questionable aspect of the implementation is the use of the indexing notation. Although it may be completely avoided, both kjSets and kjGraphs allow indexing. In the case of sets S[object]=anything inserts the object as a member of the set and disregards anything, and a retrieval S[object] returns 1 if object is a member of the set or raises an key error otherwise. For example:
>>> S kjSet([1, 3, 2]) >>> S["this"] = "that" >>> S kjSet([1, 3, 2, 'this']) >>> S["this"] 1 >>> S["that"] KeyError: that
In the case of graphs G[object]=map adds (object, map) as a new arc of the graph, and G[object] retrieves an arbitrary neighbor associated with object, or raises a KeyError if there is none. For example:
>>> G kjGraph([(1, 3), (4, 1)]) >>> G[1] = 9 >>> G kjGraph([(1, 3), (4, 1), (1, 9)]) >>> G[1] 3 >>> G[6] KeyError: 6
Some may find this use of indexing notation non-intuitive, but others may find it appealing, as far as I know.
Index assignment is an "in place" mutation operation -- it will raise an error if the object has been hashed.
The implementation provides a number of common set theoretical operations over these structures. All the set algebraic operations are side effect free (and they may be applied to tables which have been hashed). These operations may be applied to tables with differing types, except where noted. Except for intersection and difference, a binary operation applied to objects of different types produces an object of the "more general" type, i.e, S+D produces a (possibly dirty) dictionary, S+G produces a graph, D+G produces a graph. Binary operations applied to objects of the same type produces an object of that type.
Generally, when a set S is used in permitted mixed-mode algebra with a graph or a dictionary it "acts like" the identity dictionary S.ident().
The built in algebraic operations are as follows.
produces the union of two structures of the same type, invoked using either the notation X+Y or X|Y. For example:
>>> kjGraph([(1,3), (4,1), (1,9)]) + kjSet([6,7,2]) kjGraph([(1, 3), (4, 1), (1, 9), (6, 6), (7, 7), (2, 2)])
If dictionary D1 contains (key, map1) and dictionary (or set) D2 contains (key, map2) then D1+D2 will be a dirty dictionary containing one of the pairs, but not the other.
produces the set difference of two structures of the same type, invoked using the notation X-Y. For example:
>>> kjSet([1,2,5,7]) - kjSet([1,2,4,8]) kjSet([7, 5])
Differences of graphs and dictionaries are allowed, where X-Y produces an object of the same type as X, but mixed differences are not allowed when one of the arguments is a set (yet).
with notation G1*G2 produces the graph containing (s1,d2) whenever there is an arc (s1,d1) in G1 and an arc (d1,d2) in G2}. For example:
>>> G1 = kjGraph([(0, 1), (1, 2), (3, 0), (3, 4), (2, 3)]) >>> G2 = kjGraph([(4, 0), (0, 1), (1, 2), (3, 1), (2, 0)]) >>> G1*G2 kjGraph([(0, 2), (1, 0), (3, 1), (3, 0), (2, 1)])
Any two tables can be composed, producing an object of the more general type. Composing two sets is a slower way to compute their intersection.
with notation ~G produces the graph containing (d, s) if and only if G contains (s, d). For example:
>>> G = kjGraph([(0, 0), (3, 2), (6, 4), (20, 1), (23, 3), (26, 5)]) >>> ~G kjGraph([(0, 0), (4, 6), (1, 20), (3, 23), (2, 3), (5, 26)])
Transposition is defined for dictionaries, but if there are key collisions the winning pair will be decided arbitrarily and the resulting table will be dirty. For example:
>>> ~kjDict([("hello","hi"), ("hola","hi"), ("beat it","bye")]) kjDict([('bye', 'beat it'), ('hi', 'hola')])
This operation is not defined for sets.
produces the set intersection of two structures invoked using the notation X&Y. For example:
>>> G = kjGraph([(0,0), (3,2), (6,4), (20,1), (23,3), (26,5), (2,23)]) >>> G & ~G.tclosure() kjGraph([(0, 0), (3, 2), (23, 3), (2, 23)])
Mixed mode intersections between graphs and dictionaries are allowed producing the less general dictionary type. Mixed mode intersections where one of the arguments is a set is not permitted.
Note: The graph and dictionary operations of composition, reachability, transitive closure, and transposition assume that "right members" (values) are hashable. If any right member is not hashable these functions may raise a TypeError, for example:
>>> X = kjGraph([ (1,{}) ]) >>> ~X TypeError: unhashable type
Here the empty Python dictionary is not a hashable type, so it could not be used in the transposed graph as a left member.
These structures use a hash table based representation which should deliver expected good performance for many applications. Nevertheless, as with all hash implementations there is a theoretical possibility of very bad worst case performance. Furthermore, inserts and deletes occasionally cause the internal structure to resize, so although the average speed for inserts and deletes is expected to be "near constant", once in a while an insert or delete may be slow.
In addition, since the kjGraph implementation hashes using the left member only from each graph arc, graphs where many nodes have a very large number of neighbors may have poor access times. In this case it may appropriate to use a "set of pairs" or a "dict of sets" representation in place of a kjGraph, if this is possible, or some alternative implementation.
The implementation of G.tclosure is "quick and dirty (keep it simple, stupid)" and leaves much room for speed improvements. It may be slow for large and complex graphs. If this is a problem I might be enticed to improve it, let me know.
Someday I'd like to make the deletion operations faster (by a constant factor), but I'm not highly motivated here since I personally tend to build up tables without ever deleting anything.
Once again I'd like to commend Guido and the other Python contributors on their work. It's a delight to know that Python is nice both at the front end and at the back end.
The package is written in C but descends from an ancestor (not suitable for public viewing) which was written exclusively in Python. I wrote this module (1) as an experimented in extending Python using C and (2) as an experiment in migrating a Python implementation to a C implementation. The result is a package which I hope may be useful to someone.
This release is about twice as fast as previous releases thanks to permiscuous use of C macros in the implementation. Additionally, mixed-type operations, coercions, and a few additional methods have been added in this release.
There is one defined constant in the C code you might want to play with: GSIZE -- the number of elements of the table heaped together in one "lump" (i.e, the size of an unordered subarray of the table). Roughly speaking if GSIZE is large the table will resize less often, and usually use space more efficiently. Generally speaking larger values will also make the accesses slower, but with a value less than around 64 this may not always be true on some machines with fancy memory caching (just guessing here, really). The default value is 6, which works pretty well on my machines. GSIZE also represents the basic size allocated for the smallest possible table, so if you expect to use lots of small sets a large GSIZE may not be advisable. Don't fiddle with the other constants unless you are willing to debug possible problems that may result.
Release 2 had a hole in the initializers that caused undefined behavior. It has been fixed in 2.1.
Release 2.1 wouldn't link under Python 1.2. This has been fixed in 2.2.
The first release would crash on certain graph operations (transpose, reachability, composition, transitive closure) applied to graphs that contained unhashable nodes. Now they raise an error instead. Previous releases also had a serious bug that sometimes corrupted the internal structure of kjSets. I don't know of any remaining "real" bugs -- the rest of this section discusses possibly confusing "features."
As mentioned above in several places, structures that have been hashed may not be subsequently modified -- attempts to modify hashed structures will raise TypeError.
Mixed mode differences and intersections are not allowed when one of the arguments is a set (as mentioned).
Some unions and transposes on dictionaries will produce a dirty dictionary if there are key collisions, and the key collisions will be decided arbitrarily. Similarly, coercing a graph to a dictionary will produce a dirty dictionary if there are key collisions. See the section on Dirtiness above.
The kjGraph implementation does not represent nodes with no edges. Programmers may work around this either by wrapping the graph in a class with a node set, or by adopting some appropriate convention that I leave to their infinitely creative imaginations.
Please let me know if you find some other bug or confusing feature. At this point I consider the package to be reasonably well tested, but I offer no warrantees.