nextcloud-desktop/doc/csync.txt

CSYNC User Guide
================
Andreas Schneider <mail@cynapses.org>
:Author Initials: ADS

csync is a bidirectional file synchronizer for Linux and allows to keep two
copies of files and directories in sync.  It uses uses widly adopted protocols
like smb or sftp so that there is no need for a server component of csync. It
is a user-level program which means you don't need to be a superuser.

Introduction
------------

It is often the case that we have multiple copies (called replicas) of a
filesystem or part of a filesystem (for example on a notebook and on a desktop
computer). Changes to each replica are often made independently and as a
result they do not contain the same information. In that case a file
synchronizer is used to make them consistent again, without loosing any
information.

The goal is to detect conflicting <<X13, updates>> (files which has been
modified) and propagate non-conflicting updates to each replica. If there
are no conflicts left we are done and the replicas are identical.

Basics
------

This section describes some basics you might need to understand how file
synchronization works.

Paths
~~~~~
A path normally refers to a point with a set of files which should be
synchronized. It is specified relative to the root of the replica. The path is
just a sequence of names separated by '/'.

NOTE: The path separator is always a forward slash '/', even for Windows.

csync is always using the absolute path. This could be '/home/gladiac' or
for sftp 'sftp://gladiac:secret@myserver/home/gladiac'.


[[X13]]
What is an update?
~~~~~~~~~~~~~~~~~~
The contents of a path could be a file, a directory or a symbolic link
(symbolic links are not supported yet). To be more precise, if the path refers
to:

- a regular file, the the contents of the file are the byte stream and the
  metatdata of the file.
- a directory, then the content is the metadata of the directory.
- a symbolic link, then the content is the string where the link points to.

csync keeps a record of each path which has been successfully synchronized. The
path gets compared with the record and if it has changed since the last
synchronization, we have an update. This is done by comparing the modification
or change (modification time of the metadata) time.

What is a conflict?
~~~~~~~~~~~~~~~~~~~
A path is conflicting if it fulfills the following conditions:

1. it has been updated in one replica,
2. it or any of its descendants has been updated on the other replica too, and
3. its contents in are not identical.

File Synchronization
--------------------

The main goal of a file synchronizer is correctness. It changes whole or
separated pieces of a users file system. So a user is not able to monitor the
complete file synchronization process. So the synchronizer is in a position
where it can damage the file system. It is important that the implementation
behaves correctly under all conditions, even if there is an unexpected error
(for example disk full).

On problem concerning correctness is the handling of conflicts. Each file
synchronizer tries to propagate conflicting changes to the other replica. At
the end both replicas should be identical. There are different strategies to
fulfill these goals.

csync is a 3-phase file synchronizer. The desicion for this design was that
user interaction should be possible and it should be easy to understand the
process. The 3 phases are update detection, reconciliation and propagation.
These will be described in the following sections.

Update detection
~~~~~~~~~~~~~~~~
There are differnt strategies to do update detection. csync uses a state-based
modtime-inode update detector. This means it uses a the modification time to
detect updates. It doesn't require much resources. A record of each file is
stored in a database (called statedb) and compared with the current
modification time during update detection. If the file has changed since the
last synchronization a instruction is set to evaluate it during the
reconcilation phase. If we don't have a record for a file we invastigate, it is
marked as new.

There is a problem to detect renaming of files. This is sovled by the record we
store in the statedb too. If we don't find the file by the name in the database
we search for the inode number. If the inode number is found then the file has
been renamed.

Reconciliation
~~~~~~~~~~~~~~
The most improtant component is the update detector cause the reconciler depends
on it. The correctness of reconciler is mandatory cause it can damage a
filesystem. It decides which file:

* keeps untouched
* has a conflict
* gets synchronized
* or gets *deleted*

A wrong decision of the reconciler leads in most cases to a loss of data. So there
are several conditions a the file synchronizer has to follow.

Algorithms
^^^^^^^^^^

For conflict resolution several different algorithms could be implemented. The
most common algorithm are the merge and and the conflict algorithm. The first
is a batch algortihm and the second is one which needs user interaction.

Merge algorithm
+++++++++++++++

The merge algorithm is an algorithm which doesn't need any user interaction. It
is simple and used for example by Microsoft for Roaming Proflies. If it detects
a conflict (the same file changed on both replicas) then it will use the most
recent file and overwrite the other. This means you can loose some data, but
normally you want the latest file.

Conflict algorithm
++++++++++++++++++

This is not implemented yet.

If a file has a conflict the user has to decicde which file should be used.

Propagation
~~~~~~~~~~~

The next instance of the file synchronizer the propagator. It uses the
calculated records to apply them on the current replica.


The propagator uses a 2-phase-commit mechanism to simulate an atomic filesystem
operation.

In the first phase we copy the file to a temporary file on the opposite
replica. This has the advantage that we can check if file which has been copied
to the opposite replica has been transfered successfully. If the connection
gets interruppted during the transfer we still have the orignal states of the
file. This means no data will be lost.
In the second phase the the file on the opposite replica will be overwritten by
the temporary file.

After a successfull propagation we have to merge the trees to reflect the
current state of the filesystem tree. This updated tree will be written as a
journal into a database. The database is called the state database. It will be
used during the update detection of the next synchronization. See above.

Robustness
~~~~~~~~~~

This is a really important topic. The file synchronizer should not crash and if
it crashed, there should be no loss of data. To achieve this goal there are
several mechanism to prevent this. These mechnanism will be discussed in the
following sections.

Crash resistance
^^^^^^^^^^^^^^^^

The synchronization process can be interrupted by different events, this can
be:

* the system could be halted due to errors.
* the disk could be full or the quota exceeded.
* the network or power cable could be pulled out.
* the user could force a stop of the synchronization process.
* different communication errors could occur.

That no data will be lost due to the occurance we enforce the following
invariant:

IMPORTANT: At every moment of the synchronization each file has either its
original content or its correct final content.

So each interupted synchronization process is a partial sync and can be
continued and completed by simply running csync again. The only problem could
be an error of the filesystem. So we reach this invariant only approximatly.

Transfer errors
^^^^^^^^^^^^^^^

With the Two-Phase-Commit we check the file size after the file has
transferred. So we can detect transfer erros. Better would be a transfer
protocol with checksums. This could possibly done in the future.

Future filesystems like btrfs will help to compare checksums instead of the
filesize. This will make the synchronization itself safer.

Database loss
^^^^^^^^^^^^^

It could be possible, that the state database get corrupted. If this happens
all files get evaluated. In this case the file synchronizer wont delete any
file, but it could occur that deleted files will be restored from the other
replica.
To prevent a corruption or loss of the database if an error occurs or the user
forces an abort, the synchronizer is working on a copy of the database and will
use a 2-Phase-Commit to save it at the end.

Getting started
---------------

Installing csync
~~~~~~~~~~~~~~~~

See the `README` and `INSTALL` files for install prerequisites and
procedures. Packagers take a look at <<X90, Appendix B: Packager Notes>>.

Using the commandline client
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The synopsis of the commandline client is

  csync [OPTION...] SOURCE DESTINATION

It synchronizes the content of SOURCE with DESTINATION and vice versa. The
DESTINATION can be a local directory or a remote file server.

  csync /home/csync scheme://user:password@server:port/full/path

The remote destination is supported by plugins. By default csync ships with smb
and sftp support. For more information, see the manpage of csync(1).

The PAM module
~~~~~~~~~~~~~~

pam_csync is a PAM module to provide roaming home directories for a user
session. This module is aimed at environments with central file servers a user
wishes to store his home directory. The Authentication Module verifies the
identity of a user and triggers a synchronization with the server on the first
login and the last logout. More information can be found in the manpage of the
module pam_csync(8).


[[X90]]
Appendix A: Packager Notes
--------------------------

Read the `README`, `INSTALL` and `FAQ` files (in the distribution root
directory).