How Recovery/Replay of POSIX locks are done in GlusterFS?
=========================================================

Note:- Recovery and Replay are the two words used interchangeably throughout this document

How to enable lock replay?
--------------------------
To enable recovery of locks, use the following volume set option:
lock-heal <on/off>

Useful volume set option in case of lock recovery:
grace-timeout <time-in-seconds>

General Overview
----------------
libglusterfs[fd-lk.c] maintains the complete list of locks w.r.t fd which is being populated and appended on every successful lock request processed by locks translator residing on server stack. As a result both libglusterfs and locks translator contains the similar in-memory locks. When an fd is opened client adds it to its open_fd_list and each fd points to the corresponding lock_list maintained by libglusterfs so that client can keep track of all lock requests for a particular fd. If client disconnects with some locks held and reconnects again, it traverses the open_fd_list and searches for the previous lock_list and tries to reacquire those locks by replaying the same after reopening the file.

How does it work?
-----------------
Initial handshake:
The overall logic for lock recovery in GlusterFS has been implemented with the help of two concepts namely lk-version and grace-timeout. Client initializes the lk-version with 1 and server with 0 indicating a fresh handshake. During a fresh connect, as a response to client_setvolume, server sends its lk-version [which is 0] inside dictionary to client side and it checks whether both lk-version match or not [inside client_setvolume_cbk]. If it does not match client will delay CHILD_UP until all open fds are reopened and locks are reacquired as part of client_post_handshake and sets the server lk-version to 1 via client_set_lk_version.

What happens on client side after disconnect?:
As soon as client disconnects from server, it enters the pre-defined [or re-configured through volume set command] grace-timeout period from client_rpc_notify and waits for re-connection. During this period of time if server come back, then the grace-timer is cancelled and everything is fine. If reconnection doesn't happen within this interval, then client increases its lk-version by 1. Unless and until RPC_CLNT_CONNECT is received, new timer cannot be registered for each RPC_CLNT_DISCONNECT. Afterwards if server reconnects, lk-version mismatch occurs and it will try to repoen the fds and reacquire the locks. Because by that time server would have flushed all the locks from its side.

What happens on server side after disconnect?:
If server can connect to client within grace timeout everything is fine as before. Failure to reconnect will trigger grace_time_handler from server_rpc_notify to cleanup the connection [server_connection_cleanup] flushing all the locks and eventually end up in destruction of client connection. Afterwards server is as fresh and can reconnect to the client with lk-version 0.

Outstanding issues
------------------
[1] As per the current design server disconnection is followed by a connect to glusterd which may cancel the registered grace_timer on the client side. If it could connect to glusterd, it will change the port to 49152 and uses the same info and tries to connect to glusterfsd. If brick process goes offline for a long time it will again register the timer. Because of this behavior client lk-version goes on increasing after each timeout. One soultion is to cancel this grace-timer only for a connect [RPC_CLNT_CONNECT] to a brick process rather than a generic connect. Client can maintain a connection to either glusterd or glusterfsd at any given point of time.

[2] Currently grace-timeout and lock-heal volumet set options are common for both client and server which imples that both sides are updated when these volume set operations are done. Do we really need this behavior? We have http://review.gluster.org/#/c/10262/ to address this issue.

[3] As of now client doesn't clean-up the lock state after grace-timeout has happened and thus is capable of replaying the locks whenever server comes back. 

___________________________________________________________________________________________________________________________________________________________

Discussion on 27/08/2015 (Attendees: Vijay Bellur, Raghavendra Gowdappa and Anoop C S)
======================================================================================
Solutions for outstanding issues
--------------------------------
[1] Distinguish between connect to glusterd and glusterfsd in client_rpc_notify() and cancel/reset the active/registered grace_timer only on a connect to glusterfsd.
[2] Both client and server are tied together. So we don't need to separate these options.
[3] Instead of lock state being cleaned-up by each protocol client the following approach is more acceptable:
	* Mark fd as bad on grace-timeout to prevent further fops.
	* Clean-up should be done by higher translators.
	* On client grace-timeout it notifies higher translators.
	* On receiving this event decide to whether cleanup or not based on server-quorum no.of subvols.

New issues found
----------------
[1] When disconnect event is received asymmetrically on client and server side, lock-state clean-up on server may happen before grace_timeout occurs on client side. At this point of time if new client tries to connect to the same server for which the previous client had acquired some locks on a file, it can acquire conflicting locks, perform write and unlock the same regardless of previous lock state.

Discussion on 11/09/2015 (Attendees: Vijay Bellur, Raghavendra Gowdappa, Pranith Kumar and Anoop C S)
=====================================================================================================

* Client must not replay the locks
* Remove client_post_handshake things from protocol/client.
* General algorithm:
+-----------------------+-------------------------+------------------------+
|                       |                         |                        |
|     Event/Protocol    |         Client          |        Server          |
|		        |                         |                        |
+-----------------------+-------------------------+------------------------+
|                       |                         |                        |
|     Grace timeout     | * Increments lk-version | * Clean-up the locks   |
|                       | * Mark fd bad           |                        |
+-----------------------+-------------------------+------------------------+
|                       |                         |                        |
|                       | * Mark fd bad           | * In server_setvolume, |
|  lk-version mismatch  | * Notify CHILD_UP with  |   clean-up the locks   |
|			|   a note to replay      |                        |
+-----------------------+-------------------------+------------------------+

* Refer to NLM or NFSv4 logic for client lock recovery.
https://docs.oracle.com/cd/E19120-01/open.solaris/819-1634/rfsrefer-138/index.html [NFSv4 client lock recovery]
http://ps-2.kev009.com/tl/techlib/manuals/adoclib/aixbman/commadmn/nfsnetlo.htm [NLM]
http://docs.cray.com/books/S-2341-22/html-S-2341-22/le23277-parent.html [NFS & NLM]

Discussion on 12/10/2015 (Attendees: Vijay Bellur, Raghavendra Gowdappa, Pranith Kumar and Anoop C S)
=====================================================================================================

* NFS lock recovery is based on grace_timeout.
* In order to avoid asymmetric connect/disconnect event being received at both sides we get rid of the lk-version and grace_timer logic from both sides(client and server)
* Server will clean-up the locks as soon as disconnect is received and client will mark the fd as bad.
* Anoop to send a patch for removing the existing logic from both end-points along with a doc explaining the various scenarios. Upstream patch link: http://review.gluster.org/#/c/12363/

Uploaded	:	12 Aug, 2015	
Last Updated	:	12 Oct, 2015