Distributed Lock Manager
The Distributed Lock Manager is the component of the OpenVMS clustering software responsible for managing the nodes' access to shared resources. The first lock manager for a standalone system providing synchroni zation services for multiple processes residing on a single processor, as well as deadlock derection appeared in VAX/VMS V3.0 in 1982. The Distributed Lock Manager was designed by Steve Beckhardt and released in 1984 with VAX/VMS V4.0.
A resource is an entity the access to which is managed by the DLM: a file, a device, a volume, records within a file, cache buffers, etc. Each resource is represented by a unique abstract name that is agreed upon by all the cooperating processes. This name is entered into a distributed global namespace that is maintained by the DLM. When a process needs to access a resource, it requests a lock on that resource name from the DLM, and when that lock is granted, it accesses the resource. The lock manager does not actually allocate or control the resource, and that name does not have to represent an actual physical resource. This permits the lock manager services to be used for event notification and other communication functions, in addition to mutual exclusion functions. These names have common prefixes such as SYS$ for OpenVMS executive or F11B$ for XQP.
To permit maximum concurrency, resource names can be tree structured: for example, a device could be a root resource, consiting of files, consisting of records. Many resources such as databases have an inherent hierarchical structure that permits different parts to be accessed by different processes at the same time.
Locks have the following modes associated with the type of access the process will perform on the resource:
- Protected Read (share lock): grants read access to the resource and allows its sharing with other readers. No writers are allowed access to the resource.
- Protected Write (update lock): grants write access to the resource and allows its sharing with concurrent read-mode readers. No other writers are al lowed access to the resource.
- Concurrent Read: grants read access to the resource and allows sharing with other readers
- Concurrent Write: grants write access to the resource and allows its sharing with other writers
- Exclusive (exclusive lock): grants write access to the resource and prevents sharing the resource with any other readers or writers
- Null: future interest in the resource; placeholder for lock conversions
The following table determines whether or not a given mode is compatible with another mode:
The services provided by the lock manager are $ENQ (lock) and $DEQ (unlock). The $ENQ system service allows a process to request a lock on a resource from the lock manager. If the resource is currently locked by another process and the current lock mode is incompatible with the requested lock mode, the process requesting the lock may wait in the RWSCS state or continue execution; in that case when the lock request is granted, the $ENQ service provides asynchronous notification (AST). The caller can also signify that the request should not be queued, in that case if the lock is not granted immediately, the status is returned and the requesting process continues.
Applications may dynamically change their locking protocol between "blocking AST" and "request-release". Blocking AST means that after acquiring a lock on a resource, the process does not release a lock even when it is done working with the resource until it gets a blocking AST that another process is waiting for it to release the lock. Request-release means that right after a process is done with a resource, it releases the lock. Blocking AST is used for periods of low contention and the request-release protocol is used during periods of high contention. Another use for blocking AST is an implementation of a "doorbell" notification where a process takes out a lock and specifies a blocking AST, and when another process wants that first process's attention, it makes an incompatible lock request so that an AST is delivered to the first process.
A value block is a 16-byte piece of memory associated with each resource that holds the current lock mode information. Information in the value block is updated when appropriate and optionally returned by $ENQ and $DEQ.
The directory service is used to locate the current resource manager (which may change over time). Every node in the cluster is the directory node for a subset of the resource trees that lock requests for those trees originally go to. The directory node maintains a lock directory for the resource trees that it is responsible for, keeping track of the current #Resource Manager for those resources.
The directory node for a given resource tree is determined using a hashing mechanism: the resource name specified by the lock request is hashed, and the resultant value is applied to a vector containing zero or more entries for every node currently in the cluster. The selected vector entry identifies the directory node for the resource specified. This vector is maintained by the Connection Manager and is updated every time a node joins or leaves the cluster. Each node can request that it be entered zero or more times in the directory vector, depending on the extent to which the node wants to participate in the distributed directory function, by setting its LOCKDIRWT system parameter.
The node requesting the lock sends the lock request to the directory node. The directory node then has three options:
- if it is also the Resource Manager, handle the lock request.
- if it is not the Resource Manager, point the requesting node to the resource manager
- if the node requesting the lock is the Resource Manager or if there is no Resource Manager, it instructs that node requesting the lock to handle the lock itself. It will also create a directory entry for the resource to make sure that whenever a lock request is received on that resource it will know that that node is now the Resource Manager.
Once a lock on a root-level resource has been established, the identity of the resource-manager node is known. After that point no further messages are sent to the directory node by that processor; all requests are sent directly to the resource manager. If the lock request is made on a node that is not the resource manager, two messages are required for every lock request after the first: a request, and a response. This process is called remote locking.
A resource manager is the node that maintains lock information for a given resource tree and controls the granting of lock requests.
If a node holding the last remaining lock on the resource decides to release that lock, a message is sent to the directory node indicating that node is no longer managing the resource. The directory node then deletes the directory entry for the resource. This deletion allows the next node requesting a lock on the resource to become the resource manager. For the case in which a process releasing a lock does not reside on the node that manages the resource, a message is sent to the resource manager. Again, if this is the last remaining lock on the resource, the resource manager sends a message to the directory node indicating that this node is no longer the resource manager.
A lock conversion is the action of changing the mode of a current lock. Conversion requests can be processed more efficiently than new lock requests because all the data structures are already in place, and the resource manager has already been identified. If a conversion request is made on the node managing the resource, no messages need be exchanged. If the resource manager is not the node on which the request is being made, either one or two messages are required. For example, in some cases in which the requested mode is compatible with the granted mode, the request can be unilaterally granted , and a single message sent to notify the resource manager of the change. In others, the resource manager must make a decision based on the other requests that are granted. A request is then sent to the resource manager, who must respond. In all cases, no communications are required with the directory node.
Lock remastering stands for moving lock mastership to another node which also involves moving the lock information on the entire resource tree to that node. This happens whenever a node leaves the cluster and dynamically to minimize the overhead of sending off-node messages to the directory node and the lock master.
LOCKRMWT indicates the extent to which the node is willing to master lock trees: 0 means that no locks will be remastered to that node, i.e. it will only master trees if it is the only node that holds any locks in them; 10 means that it will master all trees whose current lockmaster's LOCKRMWT is smaller than 10. In all other cases the difference between the current lock master and the new lock master's LOCKRMWT is considered.
PE1 establishes a threshold on the number of locks in a tree that is eligible for moving to a new lock master. A negative PE1 means that this node's trees shall not be remastered. A PE1 of 0 indicates that trees of any size can be remastered. Any other value is the number of locks in a tree after which it should not be moved.
Lock Database Rebuilding
When a node joins or leaves the cluster, the lock database must be rebuilt.
The lock database is rebuilt in the following fashion by each node. First, new lock requests are disabled. Then, the lock database is scanned and all directory information is removed, since a change in membership redistributes the directory functions. Information about locks that are either held or requested by processes on other nodes is also discarded. These actions result in a period of time during which no directory nodes and no resource managers exist. The only information retained concerns the lock requests made by processes actually residing on a node.
At this point the nodes re-acquire all the locks held before the membership changed, using the same algorithm by which the locks were initially acquired. Locks that were waiting to be granted are re-ordered by a sequence number that was assigned when they were queued so that the order in which they wait is preserved. By the process of re-acquiring locks, new directory entries are created and new resource managers chosen.
Since each node re-acquires its own locks, the locks held by nodes that are no longer members of the cluster are released. Once all locks have been re-acquired, an attempt is made to grant waiting locks since the removal of lock requests contributed by a failed node may permit waiting requests to be granted. Once these actions have been accomplished, locking is enabled and activity proceeds normally.
A multiple-resource deadlock is a condition where processes are waiting for resource locks to be released in a circular fashion: for example, both process A and process B are interested in resources C and D to complete their operations; A holds the lock for C and B holds the lock for D, so A is waiting for B to release D while B is waiting for A to release C.
There are also conversion deadlocks that involve multiple conversion requests on a single resource. For example, two CR locks are held on a single resource; a conversion of the first lock to EX is attempted. The conversion must wait for the second lock to be released or converted to a compatible mode. If the second lock is also attempted to be converted to EX, a conversion deadlock results: the first conversion request cannot be granted while the second lock is held at the original mode, and the second request cannot be granted because it must wait for the first lock to be granted.
If a process has waited for a lock or a conversion longer than a configuration-specified timeout, a deadlock search is initiated (first for a conversion deadlock, and then for a multiple-resource deadlock). If a deadlock (incompatible lock requests) is detected, a victim process is selected, and the lock request or the victim is completed with an error status indicating that a deadlock was found.
Locks can be viewed with the SHOW LOCKS, SHOW RESOURCES, and SHOW PROCESS/LOCKS commands of the System Dump Analyzer.
- VAXCluster articles in the Digital Technical Journal
- OpenVMS Distributed Lock Manager Performance, a 2002 presentation by Keith Parris
- DLM Programming demos and stories by Niel Rieck
- OpenVMS Clusters: Theory of Operation, a 2003 presentation by Keith Parris
- OpenVMS Cluster Load Balancing, a presentation by Paul Williams (Parsec)
- VSI OpenVMS Cluster Systems