System and method for efficient lock recovery

Abstract

A system and method are disclosed for providing a system and method for an efficient lock recovery. In one embodiment of the present invention, a multiple node networked system shares a resource such a shared storage. Among various other aspects of the present invention, when there is a change, such as a server failure, that prompts a recovery, the recovery can be performed in parallel among the nodes. A further aspect of an embodiment of the present invention includes recovering a lock with a higher level of exclusion prior to another lock with a lower level of exclusion.

Claims

What is claimed is: 1 . A method for recovering locks comprising: determining a lock recovery should be performed; recovering a first lock; recovering a second lock; wherein recovering the first lock occurs approximately in parallel to the recovering of the second lock. 2 . The method of claim 1 , further comprising providing a first lock manager and a second lock manager. 3 . The method of claim 1 , wherein the first lock is recovered in a first node and the second lock is recovered in a second node. 4 . The method of claim 3 , wherein the first node is a server. 5 . The method of claim 3 , wherein the first node and the second node share a resource. 6 . The method of claim 3 , wherein the shared resource is a shared storage. 7 . The method of claim 1 , further comprising recovering a third lock after recovering the second lock, wherein the second lock has a higher level of exclusion than the third lock. 8 . The method of claim 1 , further comprising recovering a third lock after recovering the second lock, wherein the second lock is a write lock and the third lock is a read lock. 9 . The method of claim 1 , wherein the first lock is recovered from a first lock space to a second lock space. 10 . The method of claim 1 , wherein the determining the lock recovery should be performed includes recognizing a change in a lock domain membership. 11 . The method of claim 1 , wherein the determining the lock recovery should be performed occurs when a node fails. 12 . The method of claim 1 , wherein the determining the lock recovery should be performed occurs when a node is added. 13 . The method of claim 1 , wherein the determining the lock recovery should be performed occurs when a shared resource fails. 14 . The method of claim 1 , wherein the determining the lock recovery should be performed occurs when a network interconnect fails. 15 . The method of claim 1 , wherein the determining the lock recovery should be performed occurs when a shared resource is added. 16 . A method for recovering a lock in a node comprising: determining a lock recovery should be performed; recovering a first lock; recovering a second lock after recovering the first lock; wherein the first lock is of a higher level of exclusion than the second lock. 17 . The method of claim 16 , further comprising providing a lock manager. 18 . The method of claim 16 , wherein the node is a server. 19 . The method of claim 16 , wherein the node shares a resource with a second node. 20 . The method of claim 19 , wherein the shared resource is a shared storage. 21 . The method of claim 16 , wherein the first lock is a write lock. 22 . The method of claim 16 , wherein the second lock is a read lock. 23 . The method of claim 16 , wherein the first lock is recovered from a first lock space to a second lock space. 24 . The method of claim 16 , wherein the determining the lock recovery should be performed includes recognizing a change in a lock domain membership. 25 . A system for recovering a lock comprising: a processor configured to determine whether a lock recovery should be performed; recovering a first lock; recovering a second lock after recovering the first lock; wherein the first lock is of a higher level of exclusion than the second lock; and a memory coupled to the processor, wherein the memory is configured to provide instructions to the processor. 22 . A system for recovering locks comprising: a first node configured to recover a first lock; a second node configured to recover a second lock; a resource shared by the first and second nodes; wherein recovering the first lock occurs approximately in parallel to the recovering of the second lock. 23 . A method for recovering locks comprising: providing a lock manager in a first node; providing a lock manager in a second node; recognizing a change in a domain lock membership; recovering a first lock in the first node; recovering a second lock in the first node; wherein the first lock is of a higher level of exclusion than the second lock. 24 . A method for recovering locks comprising: recognizing a change in membership; providing a first lock space; providing a second lock space; recovering a lock; recovering a second lock; wherein recovering the first lock occurs approximately in parallel to the recovering of the second lock. 25 . A computer program product for recovering locks, the computer program product being embodied in a computer readable medium and comprising computer instructions for: determining a lock recovery should be performed; recovering a first lock; recovering a second lock after recovering the first lock; wherein the first lock is of a higher level of exclusion than the second lock.
CROSS REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Provisional Patent Application No. 60/324,196 (Attorney Docket No. POLYP001+) entitled SHARED STORAGE LOCK: A NEW SOFTWARE SYNCHRONIZATION MECHANISM FOR ENFORCING MUTUAL EXCLUSION AMONG MULTIPLE NEGOTIATORS filed Sep. 21, 2001, which is incorporated herein by reference for all purposes. [0002] This application claims priority to U.S. Provisional Patent Application No. 60/324,226 (Attorney Docket No. POLYP002+) entitled JOURNALING MECHANISM WITH EFFICIENT, SELECTIVE RECOVERY FOR MULTI-NODE ENVIRONMENTS filed Sep. 21, 2001, which is incorporated herein by reference for all purposes. [0003] This application claims priority to U.S. Provisional Patent Application No. 60/324,224 (Attorney Docket No. POLYP003+) entitled COLLABORATIVE CACHING IN A MULTI-NODE FILESYSTEM filed Sep. 21, 2001, which is incorporated herein by reference for all purposes. [0004] This application claims priority to U.S. Provisional Patent Application No. 60/324,242 (Attorney Docket No. POLYP005+) entitled DISTRIBUTED MANAGEMENT OF A STORAGE AREA NETWORK filed Sep. 21, 2001, which is incorporated herein by reference for all purposes. [0005] This application claims priority to U.S. Provisional Patent Application No. 60/324,195 (Attorney Docket No. POLYP006+) entitled METHOD FOR IMPLEMENTING JOURNALING AND DISTRIBUTED LOCK MANAGEMENT filed Sep. 21, 2001, which is incorporated herein by reference for all purposes. [0006] This application claims priority to U.S. Provisional Patent Application No. 60/324,243 (Attorney Docket No. POLYP007+) entitled MATRIX SERVER: A HIGHLY AVAILABLE MATRIX PROCESSING SYSTEM WITH COHERENT SHARED FILE STORAGE filed Sep. 21, 2001, which is incorporated herein by reference for all purposes. [0007] This application claims priority to U.S. Provisional Patent Application No. 60/324,787 (Attorney Docket No. POLYP008+) entitled A METHOD FOR EFFICIENT ON-LINE LOCK RECOVERY IN A HIGHLY AVAILABLE MATRIX PROCESSING SYSTEM filed Sep. 24, 2001, which is incorporated herein by reference for all purposes. [0008] This application claims priority to U.S. Provisional Patent Application No. 60/327,191 (Attorney Docket No. POLYP009+) entitled FAST LOCK RECOVERY: A METHOD FOR EFFICIENT ON-LINE LOCK RECOVERY IN A HIGHLY AVAILABLE MATRIX PROCESSING SYSTEM filed Oct. 1, 2001, which is incorporated herein by reference for all purposes. [0009] This application is related to co-pending U.S. Patent Application No. ______ (Attorney Docket No. POLYP001) entitled A SYSTEM AND METHOD FOR SYNCHRONIZATION FOR ENFORCING MUTUAL EXCLUSION AMONG MULTIPLE NEGOTIATORS filed concurrently herewith, which is incorporated herein by reference for all purposes; and co-pending U.S. Patent Application No. ______ (Attorney Docket No. POLYP002) entitled SYSTEM AND METHOD FOR JOURNAL RECOVERY FOR MULTINODE ENVIRONMENTS filed concurrently herewith, which is incorporated herein by reference for all purposes; and co-pending U.S. Patent Application No. POLYP003 (Attorney Docket No. ______) entitled A SYSTEM AND METHOD FOR COLLABORATIVE CACHING IN A MULTINODE SYSTEM filed concurrently herewith, which is incorporated herein by reference for all purposes; and co-pending U.S. Patent Application No. ______ (Attorney Docket No. POLYP005) entitled A SYSTEM AND METHOD FOR MANAGEMENT OF A STORAGE AREA NETWORK filed concurrently herewith, which is incorporated herein by reference for all purposes; and co-pending U.S. Patent Application No. ______ (Attorney Docket No. POLYP006) entitled SYSTEM AND METHOD FOR IMPLEMENTING JOURNALING IN A MULTI-NODE ENVIRONMENT filed concurrently herewith, which is incorporated herein by reference for all purposes; and co-pending U.S. Patent Application No. ______ (Attorney Docket No. POLYP007) entitled A SYSTEM AND METHOD FOR A MULTI-NODE ENVIRONMENT WITH SHARED STORAGE filed concurrently herewith, which is incorporated herein by reference for all purposes. FIELD OF THE INVENTION [0010] The present invention relates generally to computer systems. In particular, the present invention relates to computer systems that share resources such as storage. BACKGROUND OF THE INVENTION [0011] In complex networked systems, multiple nodes may be set up to share data storage. Preferably, in order to share storage, only one node or application is allowed to alter data at any given time. In order to accomplish this synchronization, locks that provide the necessary mutual exclusion may be used. [0012] Frequently such locks can be managed by a single node of the network system. This single node can be a separate computer dedicated to managing a lock for the other nodes. In such a configuration, each of the other nodes in the networked system is required to communicate with the lock manager node in order to access the locks. A potential problem occurs if the lock manager node crashes, losing the locking mechanism. [0013] It would be desirable to be able to recover the lock states after a crash of a node that was managing the locks so that the locks can properly be synchronized. The present invention addresses such a need. BRIEF DESCRIPTION OF THE DRAWINGS [0014] The present invention will be readily understood by the following detailed description in conjunction with the accompanying drawings, wherein like reference numerals designate like structural elements, and in which: [0015] [0015]FIG. 1 is a block diagram of a shared storage system suitable for applying the lock recovery mechanism according to an embodiment of the present invention. [0016] [0016]FIG. 2 is a block diagram of a situation in which recovery would be desirable according to an embodiment of the present invention. [0017] [0017]FIG. 3 is a block diagram of software components associated with nodes according to an embodiment of the present invention. [0018] [0018]FIG. 4 is a flow diagram of a method according to an embodiment of the present invention for recovering locks. [0019] FIGS. 5 A- 5 I are additional flow diagrams of a method according to a method of the present invention for recovering locks. [0020] [0020]FIG. 6 are flow diagrams for parallel recovery of locks according to an embodiment of the present invention. [0021] [0021]FIG. 7 shows an example of lock spaces used in recovery. DETAILED DESCRIPTION [0022] It should be appreciated that the present invention can be implemented in numerous ways, including as a process, an apparatus, a system, or a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. It should be noted that the order of the steps of disclosed processes may be altered within the scope of the invention. [0023] A detailed description of one or more preferred embodiments of the invention are provided below along with accompanying figures that illustrate by way of example the principles of the invention. While the invention is described in connection with such embodiments, it should be understood that the invention is not limited to any embodiment. On the contrary, the scope of the invention is limited only by the appended claims and the invention encompasses numerous alternatives, modifications and equivalents. For the purpose of example, numerous specific details are set forth in the following description in order to provide a thorough understanding of the present invention. The present invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the present invention is not unnecessarily obscured. [0024] [0024]FIG. 1 is a block diagram of a shared storage system suitable for applying the lock recovery mechanism according to an embodiment of the present invention. FIG. 1 shows a shared storage system 150 . In this example, nodes 102 A- 102 D are coupled together through a network switch 100 . The network switch 100 can represent any network infrastructure such as a 10 Mb or 100 Mb Ethernet, Gigabit Ethernet, Infiniband, etc.. Additionally, the nodes 102 A- 102 D are also shown to be coupled to a data storage interconnect 104 . An example of the data storage interconnect 104 may be Infiniband, FibreChannel, iSCSI, shared SCSI, wireless, infra-red communication links, proprietary interconnects, etc. Examples of nodes 102 A- 102 D include but are not limited to computers, servers, and any other processing units or applications that can share storage or data. The data interconnect 104 is shown to be coupled to shared storage 106 A- 106 D. Examples of shared storage 106 A- 106 D include any form of storage such as hard drive disks, compact disks, tape, and random access memory. [0025] Although the system shown in FIG. 1 is a multiple node system, the present invention can also be used with a single computer system for synchronizing various applications as they share data on shared storage. [0026] [0026]FIG. 2 is a block diagram of a situation in which recovery would be desirable according to an embodiment of the present invention. In this example, nodes 102 A′- 102 D′ are shown to have mounted file systems 106 A′- 106 B′. The term “mounted” is used herein to describe a file system 106 A′- 106 B′ that is accessible to a particular node. In this example, node 102 A′ has mounted file system 106 A′, node 102 B′ has mounted both file systems 106 A′ and 106 B′, node 102 C′ has also mounted both file systems 106 A′ and 106 B′, and node 102 D′ has mounted file system 106 B′. File systems 106 A′- 106 B′ correspond, in this example, to file system structures of data storage 106 A- 106 D of FIG. 1. [0027] Shared storage can be any storage device, such as hard drive disks, compact disks, tape, and random access memory. A filesystem is a logical entity built on the shared storage. Although the shared storage is typically considered a physical device while the filesystem is typically considered a logical structure overlaid on part of the storage, the filesystem is sometimes referred to herein as shared storage for simplicity. For example, when it is stated that shared storage fails, it can be a failure of a part of a filesystem, one or more filesystems, or the physical storage device on which the filesystem is overlaid. Accordingly, shared storage, as used herein, can mean the physical storage device, a portion of a filesystem, a filesystem, filesystems, or any combination thereof. [0028] In FIG. 2 it is shown that node 102 D′ has failed in some form. Any failure to communicate with the other nodes can be a reason to recover locks. For example, node 102 D′ may have crashed, it may have been unplugged, had a power outage, or a communication interconnect may have been severed either on purpose or accidentally. Other examples of catalysts for lock recovery include panic or catatonic state of the operating system on a node, failure of network interconnection hardware or software, general software failures, graceful or forced unmount of a shared file system by a node, failure of a storage interconnect or a storage device. Another possible catalyst for a lock recovery is when a node is added to the shared storage system, such as the shared storage system shown in FIG. 1, and that additional node unbalances the number of nodes currently participating as lock home nodes versus non-home nodes. A heavily imbalanced system will not scale in work load as well as a balanced system, and recovery of the locks may be desirable in this situation. A lock home node, as used herein, is the server that is responsible for granting or denying lock requests for a given DLM lock when there is no cached sufficient lock reference available on the requesting node. In this embodiment, there is one lock home node per lock. The home node does not necessarily hold the lock locked but if other nodes hold the lock locked or cached, then the home node has a description of the lock since the other nodes that hold the lock locked or cached communicated with the home node of the lock in order to get it locked or cached. [0029] [0029]FIG. 3 is a block diagram of software components associated with nodes according to an embodiment of the present invention. In this example, nodes 102 A- 102 C are shown to be servers. They are connected through a network interconnect 100 , as well as the storage interconnect 104 that couples the nodes 102 A- 102 C to the shared storage 106 A- 106 D. Each of the nodes 102 A- 102 C are shown to include a distributed lock manager (DLM) 300 A- 300 C. Accordingly, rather than having a centralized lock manager that is located in a single node and forcing the other nodes to communicate through it, the system shown in FIG. 3 includes more than one node with a lock manager 300 A. According to an embodiment of the present invention, each node that is configured to be able to access the shared storage 106 A- 106 D includes a distributed lock manger 300 A- 300 C. [0030] The DLM is shown to include lock domains 302 A- 302 D. Such a lock domain, as used herein, includes the context in which a shared resource, such as a shared file system, is associated with a set of locks used to coordinate access to the resource by potential contenders such as nodes 102 A- 102 C. Locking operations related to a particular file system are contained in the lock domain that is associated with that file system. In this embodiment, there is one lock domain per shared resource requiring coordinated access. For example, node 102 A is shown to include lock domains 302 A- 302 D so that node 102 A has access to shared storage 106 A- 106 D. Node 102 B is shown to include lock domains 302 A, 302 C, and 302 D, which correspond to having access to shared storage 106 A, 106 C, and 106 D. Likewise, node 102 C is shown to include lock domains 302 A, 302 B, and 302 C, which allows access to shared storage 106 A, 106 B, and 106 C. Within each lock domain, there is shown a primary lock space, such as lock space 304 A, and recovery lock space such as lock space. 306 A. Further details of the primary lock space 304 A and recovery lock space 306 A will later be discussed in conjunction with the remaining figures. Nodes 102 A- 102 C are also shown to include clients 308 A- 308 J. [0031] The clients 308 A- 308 J, as used herein, are software or hardware components that can issue lock requests to the DLM and receive lock grants from the DLM. An example of a client 308 A for DLM 300 A is a file system module software running on node 102 A with respect to a specific file system that the node has mounted. In the example shown in FIG. 3, each client 308 A- 308 J represents a mounted file system on a specific node, managed by the logic of the file system module which runs locally on that node. The file system module software issues the necessary lock requests to the DLM instance also running on that node. In turn, each DLM can communicate with its neighboring DLM instances. [0032] Each DLM instance may have many clients. Nodes 102 A- 102 C are also shown to include matrix membership service 304 A- 304 C. The matrix membership service (MMS) monitors the membership list of each lock domain 302 A- 302 D, and communicates lock domain membership changes to each of the DLMs 300 A- 300 C. The domain membership list is a list of all nodes, servers in this example, participating in lock negotiation in a given lock domain. This list is maintained and updated by the matrix membership service 304 A- 304 C. [0033] According to an embodiment of the present invention, nodes of a cluster use a locking mechanism implemented by a distributed lock manager (DLM), to coordinate access and updates to shared data, e.g., shared file system data residing in a storage subsystem accessible to each of the nodes in the cluster. Nodes obtain locks covering the shared data that they wish to use or modify, and this ensures that other nodes will not modify such data simultaneously or incoherently. Each node of the cluster is thus able to always see a consistent view of the shared data structures even in the presence of updates being performed by other nodes. [0034] So long as the node membership of the cluster is unchanged, i.e., no nodes are added or subtracted from the cluster, the locking operations required to provide the coherency of the shared data occur in an environment in which all lock state is known. The unexpected failure, however, of one or more nodes requires the DLM to perform lock recovery, as all locks held by the failed nodes must be released, and awarded to other nodes requesting the locks that were held previously by the failed nodes. Depending upon the circumstances, the lock recovery operation may also need to be coordinated with the software or hardware components using the locking services provided by the DLM, to properly repair shared data state rendered incoherent by the presence of incomplete modifications against the shared data that were in progress at the time of the node failure(s). [0035] A lock recovery may also be required when nodes are administratively removed from, or in some cases added to the cluster, depending upon certain implementation choices of both the lock manager and the components that use the locking services provided by the DLM. [0036] [0036]FIG. 4 is a flow diagram of a method according to an embodiment of the present invention for recovering locks. In this example, a change to the membership of a lock domain occurs ( 400 ). In the example shown in FIG. 3, a change occurs in the domain membership list maintained by the matrix membership service 304 A- 304 C which lists all the servers participating in lock negotiation in a given lock domain. For example, if node 102 C is taken off line, either due to failure or on purpose, then there would be a change to the domain membership list of lock domains 302 A, 302 B, and 302 C. The change to the domain membership list would exclude node 102 C in this example. Similarly, the addition of a node to a cluster can be interpreted as a change in membership, when the new node begins to use the shared storage of the cluster. [0037] A new domain membership list is generated ( 402 ). For example, for the new domain membership of lock domain 302 A, the members would now include 102 A and 102 B and would no longer include the failed node 102 C of FIG. 3. [0038] It is then determined whether there were any locks held ( 404 ) prior to the generation of the new domain membership ( 402 ). If locks were held, then these locks are recovered and rebalanced ( 406 ). Further details of the recovery and rebalancing of the held locks will be discussed later in conjunction with the remaining figures. [0039] If no locks were held, or they have been recovered, it is determined whether there were queued lock requests ( 408 ) prior to the generation of the new domain membership ( 402 ). If there were queued lock requests, then these lock requests are recovered ( 410 ). [0040] Thereafter, normal lock activities are resumed ( 416 ). [0041] The same method that was used to originally acquire the locks can be used to reacquire the locks at the recovery stage. Accordingly, the same program logic used to acquire the locks can be used at the recovery stage. Methods for acquiring locks during normal DLM operation when the node membership state of the containing lock domain is not changing, are well known in the art and these acquisition methods can be used in conjunction with the present invention. [0042] FIGS. 5 A- 5 I are additional flow diagrams of a method according to an embodiment of the present invention for recovering locks. In this example, assume that a change in membership occurs. For each lock domain affected by the change in membership, the method exemplified by FIGS. 5 A- 5 I can be executed concurrently. [0043] A new membership set for a specific lock domain is sent to the DLM of all nodes previously in the set as well as all nodes added to the set. ( 500 ). For each such recipient node's DLM, it is then determined whether a lock recovery for the lock domain was already in progress ( 502 ). For example, a node may have been added to the new membership set during a recovery procedure. If a lock recovery for the lock domain was already in progress, then all lock recovery states for the interrupted lock recovery is deleted ( 504 ). It is then determined whether the recipient node is present in the new membership set for the lock domain ( 506 ). If, however, a lock recovery for the lock domain was not already in progress ( 502 ), then it is determined whether the recipient node is present in the new membership set for the lock domain ( 506 ). If the recipient node is not present in the new membership set, then the clients of the lock domain on the node of the recipient DLM are informed that they have been removed from the domain membership ( 508 ). The lock domain and all its lock states are then deleted ( 510 ). [0044] If the recipient node is present in the new membership set for the lock domain ( 506 ), then it is determined whether the lock domain already exists for this DLM ( 512 of FIG. 5B). If the lock domain already exists for this DLM ( 512 ), then lock requests in the lock domain received by the DLM from the local clients after notification of the lock domain's membership change are merely queued for later completion during lock recovery ( 514 ). New lock domain membership generation and synchronization are then established for this lock domain between all DLMs in the membership and between each DLM and its clients ( 518 ). [0045] If the lock domain does not already exists for this DLM ( 512 ), then the recipient node is being added to the lock domain, and an empty lock domain state is created for the lock domain, and the lock domain generation number is set to zero ( 516 ). A new lock domain membership synchronization is established for this lock domain between all DLMs in the membership and between each DLM and its clients ( 518 ). Further details of this synchronization is later discussed in conjunction with FIG. 5F. [0046] It is then determined whether the node was a member of the original membership set prior to the start of the process of establishing the new membership set ( 520 of FIG. 5C). If the node was a member of the original membership set, then its clients are informed to pause (temporarily suspend) the use of the lock domain ( 522 ). The DLM then waits for the clients to acknowledge the pause request ( 524 ). Any client acknowledgement messages which do not contain the latest lock domain generation number are ignored. The state of all locks in primary lock space of the lock domain is then examined and all cached locks are discarded ( 526 ). Thereafter, the method continues to step 528 . [0047] If the node was not a member of the original membership ( 520 ), then the state of all locks in primary lock space of the lock domain are examined and all cached locks are discarded ( 526 ). [0048] Similarly, all locks in the primary lock space of the lock domain are discarded for which this node is the home node of the lock but does not hold the lock locked itself ( 528 ). Steps 526 and 528 can be performed in the reverse order of what is shown in this example, or they can be performed simultaneously. [0049] All held locks for the locked domain are then recovered and rebalanced ( 530 ). This lock recovery and rebalancing may result in a lock being granted or denied. However, the clients are not yet informed about the granting/denial of the locks. Instead, the delivery of such information is queued for step 548 . [0050] When there is a change in the membership, the home node will generally change during the recovery phase, automatically rebalancing the locks. Balance, as used herein, denotes that the home node has approximately equal probability of being in any particular node. [0051] In one embodiment, the home node can be assigned for a particular lock at the time the lock is requested. For example, to identify the new lock home node for each recovered lock, a hash algorithm can be used to convert the existing lock name into an integer, modulo the number of nodes in the new domain membership list. Thus the lock home node can always be determined based solely on the name of the lock. Since all DLMs have the same domain membership list, the same lock names, and the same hash algorithm to discover the lock home node, all DLMs will reach the same conclusion regarding the new lock home nodes for every lock. [0052] Further details of the recovery and rebalance of locks are later discussed in conjunction with FIGS. 5 G- 5 H. [0053] All queued lock requests for the lock domain are then recovered ( 532 ). As with step 530 , this recovery may result in a lock being granted or denied as in step 530 . Again, the clients are not yet informed about the granting of the locks and the delivery of such information is queued for step 548 . Further details of this recovery will later be discussed in conjunction with FIGS. 5I. [0054] The recovery lock space is now made the primary lock space and all lock space in the old primary lock space is discarded ( 534 ). It is then determined whether this node is the recovery coordinator node ( 536 ). Further details of the recovery coordinator will later be discussed in conjunction with FIG. 5F. If it is not the recovery coordinator, then the recovery coordinator node is informed that this node has requeued all previously queued requests ( 538 ). Clients are then informed of the new membership set for the lock domain ( 542 of FIG. 5E). [0055] If this node is the recovery coordinator node ( 536 ), then it waits for all other nodes in the membership to report that they have requeued their previously queued lock requests for the lock domain ( 540 ). Clients are then informed of the new membership set for the lock domain ( 542 of FIG. 5E). It is then determined whether to do a client recovery ( 544 ). If a client recovery should be performed, the node informs its clients to begin recovery and it waits for all clients to complete their recovery ( 546 ).Details of the client recovery are client specific and it is up to each client to implement its recovery. [0056] Normal operation of the lock domain is then resumed and the clients are informed accordingly ( 548 ). If no recovery is necessary, then normal operation of the lock domain is then resumed and the clients are informed accordingly ( 548 ). Determination of when a client recovery is necessary is client specific. However, client recovery is generally required when an unexpected membership change occurs, such as a node crashing. [0057] Normal operation includes notifying the clients of any lock grants or denials, or any other queued lock state updates that might have occurred during lock recovery. [0058] [0058]FIG. 5F is a flow diagram illustrating step 518 of FIG. 5D for establishing the new lock domain membership synchronization between the DLMs in the membership and between each DLM and its clients. In this example, a recovery coordinator node is selected among the new membership ( 550 ). There are many ways to select a recovery coordinator node. Examples include selecting the first node in the membership set, selecting the last node in the membership set, or selecting the node with the lowest or highest network address among the member nodes. If there is only one node in the new membership set, the one node is the recovery coordinator node. [0059] It is then determined whether this node is the recovery coordinator ( 552 ). If this node is not the recovery coordinator, then this node's concept of the generation number is sent to the recovery coordinator node and this node awaits assignment of the latest generation node from the coordinator ( 554 ). [0060] If this node is the recovery coordinator ( 552 ), then it is determined whether there are multiple nodes present in the new membership ( 556 ). If there is only one node, then the generation number is incremented ( 558 ). If the lock domain has just been created, then the generation number is now 1. [0061] If there are multiple nodes present in the new membership ( 556 ), then the recovery coordinator receives the other node's reports of their concept of the generation number ( 560 ). The highest generation number plus 1 is then made the new lock domain generation number ( 562 ). The new generation number is then sent to the other nodes in the membership ( 564 ). This new generation number is preferably included in all subsequent lock message traffic for the lock domain. In this manner, stale out-of-date lock traffic messages pertaining to previous lock domain membership sets can be detected and ignored. [0062] FIGS. 5 G- 5 H are a flow diagram illustrating step 530 of FIG. 5D for recovering all queued lock requests for the lock domain. In this example, M is the lock mode hierarchy level. Assume 1 is the most exclusive mutual exclusion level, 2 is the second most exclusive exclusion level, etc. For example, if there are only two exclusion levels, a read (stored) lock and a write (exclusive) lock, then M=1 would imply a write lock. [0063] The example shown in FIG. 5G begins with setting M=1 ( 570 ). It is then determined whether this node is the recovery coordinator node ( 572 ). If this node is not the recovery coordinator node, then this node awaits the recovery coordinator node's command to recover locks held in mode M ( 574 ). The locks are then recovered in mode M ( 578 ). [0064] If this node is the recovery coordinator node ( 572 ), then a message is sent to the other nodes in the membership to commence recovering locks held in Mode M ( 576 ). The locks are then recovered in mode M ( 578 ). This recovery is accomplished using the same mechanisms and algorithms used to acquire locks when the lock membership is unchanging (i.e. when no lock recovery is occurring). [0065] It is then determined whether this node is the recovery coordinator node ( 580 of FIG. 5H). If this node is not the recovery coordinator node, then it informs the recovery coordinator node that this node has recovered its locks in mode M ( 582 ). [0066] If this node is the recovery coordinator node ( 580 ), then it waits for all other nodes in the membership to report that they have recovered all of their locks which they hold in mode M ( 584 ). It is then determined whether this is the last M ( 586 ). If this is not the last M, then set M=M+1 ( 588 ), and the methods shown in Figure G is executed for the new M. If this is the last M ( 586 ), this portion of the lock recovery is finished. [0067] [0067]FIG. 5I is a flow diagram exemplifying step 532 of FIG. 5D for recovering queued lock requests for the lock domain. In this example, it is determined whether this node is the recovery coordinator node ( 594 ). If it is not the recovery coordinator, then it awaits the recovery coordinator node's command to requeue lock requests queued when the old lock domain membership was still in effect ( 592 ). The lock requests queued when the old lock domain membership was still in effect are requeued ( 597 ). [0068] If this node is the recovery coordinator node ( 594 ), then it sends a message to the other nodes in the membership to requeue the lock requests queued when the old lock domain membership was still in effect ( 596 ). The lock requests queued when the old lock domain membership was still in effect are requeued ( 597 ). This requeueing can be accomplished using the same mechanisms and algorithms used to acquire and queue for locks when the lock membership in unchanging and no lock recovery is occurring. [0069] [0069]FIG. 6 offers further clarification of the lock recovery algorithm. FIG. 6 shows flow diagrams for recovery of locks in each node according to an embodiment of the present invention. For each node on the membership list, the following steps occur ( 600 A- 600 C) in this embodiment. There may be other variations such as fewer than all of the nodes on the membership list may perform these steps. [0070] The locks with the highest level of mutual exclusion, such as write locks are recovered in alternate lock space ( 602 A- 602 C). Write locks are a higher level of mutual exclusion then read locks in this example because in simple usage, more than one node can read a file but only one node is preferably allowed to write to a file at a given time. Finer grained locking is possible, as are the support of lock mode hierarchies with more states than just read (shared) and write (exclusive). An example of a finer grained lock mode hierarchy can be found in Private Locking and Distributed Cache Management by David Lomet, DEC Cambridge Research Lab, Proceedings of the Third International Conference on Parallel and Distributed Information Systems (PDIS 94), Austin, Tex., Sep. 28-30, 1994. [0071] Locks with the next highest level of mutual exclusion, such as read locks, are then recovered in alternate lock space ( 604 A- 604 C). It is preferred in this embodiment to recover the write locks prior to the read locks. If there are further levels of mutual exclusion, then the recovery of those locks would take place from the highest level to the lowest level of mutual exclusion. Otherwise, lock state recovery can fail if the DLM lock recovery occurred precisely at the time a set of racing “upgrade-from-read-lock-to-write-lock-or-unlock lock X” operations were in progress. [0072] Consider that in the case of an atomic-lock-upgrade-or-unlock operation, a lock may have been the target of a simultaneous upgrade attempt by all of the clients that held the lock in read (shared) mode. At the instant the lock domain membership changes, the target lock may have been upgraded (in this example) to write (exclusive) mode. All other racing lock upgrade attempts by all of the other nodes that held the lock in read (shared) mode fails when the recovery completes with the lock unlocked by all of the other nodes which lost the race. [0073] [0073]FIG. 7 shows an example of lock spaces used in recovery. In this example, each node contains at least two lock spaces; a primary lock space 700 A- 700 B, and a recovery lock space 702 A- 702 B. In this example, a first lock, held in write mode, is recovered into the recovery lock space 702 A. Then, a second lock, held in read mode, is recovered in the recovery lock space 702 A. Likewise, a third lock, queued for write mode, is recovered into the recovery lock space 702 B, and a fourth lock queued for read mode is recovered in the recovery lock space 702 B. [0074] Referring now to steps 542 - 544 of FIG. 5C, recovery can be performed quickly because there is no need to refer to a separate file (journal) for the node to be aware of which locks it had before the crash, as the states necessary to recover the correct lock state after a membership is contained within each node's DLM. Additionally, each node recovers its own locks approximately simultaneously with the other nodes and thus the recovery process is fast. A further advantage of the method according to the present invention is that the rebalancing of locks occurs automatically when the new membership is set. The home node is calculated as though it was being calculated for the first time since this method can be used whether the members are newly formed or whether there is a change in the membership at some later time. Accordingly, when the home node is recalculated, rebalancing automatically occurs. [0075] It is then determined whether all nodes have replayed their queued lock requests ( 546 ). Once all the nodes have replayed their queued lock requests, the recovery lock space is now made into the primary lock space and the states in the old primary lock space are destroyed ( 550 ). An example of how to turn the recovery lock space into the primary lock space is to simply implement the recovery and primary lock spaces as an instance of the same set of data structures, albeit containing different lock state information. The lock domain's concept of the current active lock space is then switched by simply exchanging the primary and recovery pointers. [0076] There are several advantages to the present invention. For example, the present invention supports fast automated lock recovery from any type of node or shared resource failure such as single, multiple concurrent, or sequenced server or storage device failures. It supports on-line insertion of servers into the shared storage system. It also supports on-line insertion of shared storage devices, shared filesystems, and other components and services requiring DLM-coordinated access. It provides automated, on-line recovery of coordinated services (e.g., shared filesystems) due to software or hardware failure without interruption to the applications or coordinated services executing on the rest of the servers. It supports load-balancing of DLM lock service evenly across the nodes of the matrix. Additionally, it supports any number of servers, limited only by the latency and bandwidth of the selected inter-server interconnect, the servers' memory capacity, and the servers' instruction-cycle frequency. It is not dependent on the type of network or storage interconnects. The method works across a matrix of homogenous as well as heterogeneous operating systems and servers. [0077] Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. It should be noted that there are many alternative ways of implementing both the process and apparatus of the present invention. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.

Description

Topics

Download Full PDF Version (Non-Commercial Use)

Patent Citations (34)

    Publication numberPublication dateAssigneeTitle
    US-2002023129-A1February 21, 2002Hui-I Hsiao, Amy ChangMethod and system for efficiently coordinating commit processing in a parallel or distributed database system
    US-2002066051-A1May 30, 2002International Business Machines CorporationMethod and apparatus for providing serialization support for a computer system
    US-5276872-AJanuary 04, 1994Digital Equipment CorporationConcurrency and recovery for index trees with nodal updates using multiple atomic actions by which the trees integrity is preserved during undesired system interruptions
    US-5438464-AAugust 01, 1995Quantum CorporationSynchronization of multiple disk drive spindles
    US-5594863-AJanuary 14, 1997Novell, Inc.Method and apparatus for network file recovery
    US-5678026-AOctober 14, 1997Unisys CorporationMulti-processor data processing system with control for granting multiple storage locks in parallel and parallel lock priority and second level cache priority queues
    US-5699500-ADecember 16, 1997Ncr CorporationReliable datagram service provider for fast messaging in a clustered environment
    US-5751992-AMay 12, 1998International Business Machines CorporationComputer program product for continuous destaging of changed data from a shared cache in a multisystem shared disk environment wherein castout interest is established in a hierarchical fashion
    US-5813005-ASeptember 22, 1998Hitachi, Ltd.Method and system of database divisional management for parallel database system
    US-5813016-ASeptember 22, 1998Fujitsu LimitedDevice/system for processing shared data accessed by a plurality of data processing devices/systems
    US-5850507-ADecember 15, 1998Oracle CorporationMethod and apparatus for improved transaction recovery
    US-5909540-AJune 01, 1999Mangosoft CorporationSystem and method for providing highly available data storage using globally addressable memory
    US-5913227-AJune 15, 1999Emc CorporationAgent-implemented locking mechanism
    US-5920872-AJuly 06, 1999Oracle CorporationResource management using resource domains
    US-5953719-ASeptember 14, 1999International Business Machines CorporationHeterogeneous database system with dynamic commit procedure control
    US-5960446-ASeptember 28, 1999International Business Machines CorporationParallel file system and method with allocation map
    US-5987506-ANovember 16, 1999Mangosoft CorporationRemote access and geographically distributed computers in a globally addressable storage environment
    US-6009466-ADecember 28, 1999International Business Machines CorporationNetwork management system for enabling a user to configure a network of storage devices via a graphical user interface
    US-6021508-AFebruary 01, 2000International Business Machines CorporationParallel file system and method for independent metadata loggin
    US-6026474-AFebruary 15, 2000Mangosoft CorporationShared client-side web caching using globally addressable memory
    US-6044367-AMarch 28, 2000Hewlett-Packard CompanyDistributed I/O store
    US-6108654-AAugust 22, 2000Oracle CorporationMethod and system for locking resources in a computer system
    US-6154512-ANovember 28, 2000Nortel Networks CorporationDigital phase lock loop with control for enabling and disabling synchronization
    US-6163855-ADecember 19, 2000Microsoft CorporationMethod and system for replicated and consistent modifications in a server cluster
    US-6199105-B1March 06, 2001Nec CorporationRecovery system for system coupling apparatuses, and recording medium recording recovery program
    US-6226717-B1May 01, 2001Compaq Computer CorporationSystem and method for exclusive access to shared storage
    US-6237001-B1May 22, 2001Oracle CorporationManaging access to data in a distributed database environment
    US-6256740-B1July 03, 2001Ncr CorporationName service for multinode system segmented into I/O and compute nodes, generating guid at I/O node and exporting guid to compute nodes via interconnect fabric
    US-6269410-B1July 31, 2001Hewlett-Packard CoMethod and apparatus for using system traces to characterize workloads in a data storage system
    US-6272491-B1August 07, 2001Oracle CorporationMethod and system for mastering locks in a multiple server database system
    US-6370625-B1April 09, 2002Intel CorporationMethod and apparatus for lock synchronization in a microprocessor system
    US-6421723-B1July 16, 2002Dell Products L.P.Method and system for establishing a storage area network configuration
    US-6598058-B2July 22, 2003International Business Machines CorporationMethod and apparatus for cross-node sharing of cached dynamic SQL in a multiple relational database management system environment
    US-6804794-B1October 12, 2004Emc CorporationError condition handling

NO-Patent Citations (0)

    Title

Cited By (29)

    Publication numberPublication dateAssigneeTitle
    US-2004249904-A1December 09, 2004Silicon Graphics, Inc.Multi-class heterogeneous clients in a clustered filesystem
    US-2006053111-A1March 09, 2006Computer Associates Think, Inc.Distributed locking method and system for networked device management
    US-2006195450-A1August 31, 2006Oracle International CorporationPersistent key-value repository with a pluggable architecture to abstract physical storage
    US-2006253504-A1November 09, 2006Ken Lee, Sameer Joshi, Srivastava Alok KProviding the latest version of a data item from an N-replica set
    US-2007073855-A1March 29, 2007Sameer Joshi, Surojit Chatterjee, Ken Lee, Jonathan Creighton, Alok SrivastavaDetecting and correcting node misconfiguration of information about the location of shared storage resources
    US-2008059471-A1March 06, 2008Oracle International CorporationUsing Local Locks For Global Synchronization In Multi-Node Systems
    US-2008082533-A1April 03, 2008Tak Fung Wang, Angelo Pruscino, Wilson Wai Shun Chan, Tolga YurekPersistent locks/resources for concurrency control
    US-2010146045-A1June 10, 2010Silicon Graphics, Inc.Multi-Class Heterogeneous Clients in a Clustered Filesystem
    US-7356531-B1April 08, 2008Symantec Operating CorporationNetwork file system record lock recovery in a highly available environment
    US-7437426-B2October 14, 2008Oracle International CorporationDetecting and correcting node misconfiguration of information about the location of shared storage resources
    US-7617218-B2November 10, 2009Oracle International CorporationPersistent key-value repository with a pluggable architecture to abstract physical storage
    US-7617292-B2November 10, 2009Silicon Graphics InternationalMulti-class heterogeneous clients in a clustered filesystem
    US-7631016-B2December 08, 2009Oracle International CorporationProviding the latest version of a data item from an N-replica set
    US-7836033-B1November 16, 2010Network Appliance, Inc.Method and apparatus for parallel updates to global state in a multi-processor system
    US-8224977-B2July 17, 2012Oracle International CorporationUsing local locks for global synchronization in multi-node systems
    US-8396908-B2March 12, 2013Silicon Graphics International Corp.Multi-class heterogeneous clients in a clustered filesystem
    US-8527463-B2September 03, 2013Silicon Graphics International Corp.Clustered filesystem with data volume snapshot maintenance
    US-8560662-B2October 15, 2013Microsoft CorporationLocking system for cluster updates
    US-8578478-B2November 05, 2013Silicon Graphics International Corp.Clustered file systems for mix of trusted and untrusted nodes
    US-8683021-B2March 25, 2014Silicon Graphics International, Corp.Clustered filesystem with membership version support
    US-8838658-B2September 16, 2014Silicon Graphics International Corp.Multi-class heterogeneous clients in a clustered filesystem
    US-9058237-B2June 16, 2015Microsoft Technology Licensing, LlcCluster update system
    US-9170852-B2October 27, 2015Microsoft Technology Licensing, LlcSelf-updating functionality in a distributed system
    US-9275058-B2March 01, 2016Silicon Graphics International Corp.Relocation of metadata server with outstanding DMAPI requests
    US-9405606-B2August 02, 2016Silicon Graphics International Corp.Clustered filesystems for mix of trusted and untrusted nodes
    US-9519657-B2December 13, 2016Silicon Graphics International Corp.Clustered filesystem with membership version support
    US-9606874-B2March 28, 2017Silicon Graphics International Corp.Multi-class heterogeneous clients in a clustered filesystem
    US-9619302-B2April 11, 2017Ca, Inc.Distributed locking method and system for networked device management
    WO-2008039618-A1April 03, 2008Oracle International CorporationRessources/verrouillages rémanents destinés au contrôle de concurrence