[Box Backup-commit] COMMIT r2478 - in box/trunk/docs/api-notes: . backup raidfile

boxbackup-dev at boxbackup.org boxbackup-dev at boxbackup.org
Sat Mar 28 15:51:04 GMT 2009


Author: chris
Date: 2009-03-28 15:51:04 +0000 (Sat, 28 Mar 2009)
New Revision: 2478

Added:
   box/trunk/docs/api-notes/INDEX.txt
   box/trunk/docs/api-notes/Win32_Clients.txt
   box/trunk/docs/api-notes/backup_encryption.txt
   box/trunk/docs/api-notes/bin_bbackupd.txt
   box/trunk/docs/api-notes/bin_bbstored.txt
   box/trunk/docs/api-notes/encrypt_rsync.txt
   box/trunk/docs/api-notes/lib_backupclient.txt
   box/trunk/docs/api-notes/lib_backupstore.txt
   box/trunk/docs/api-notes/raidfile/RaidFileRead.txt
   box/trunk/docs/api-notes/raidfile/RaidFileWrite.txt
   box/trunk/docs/api-notes/win32_build_on_cygwin_using_mingw.txt
   box/trunk/docs/api-notes/win32_build_on_linux_using_mingw.txt
   box/trunk/docs/api-notes/windows_porting.txt
Removed:
   box/trunk/docs/api-notes/backup/INDEX.txt
   box/trunk/docs/api-notes/backup/Win32_Clients.txt
   box/trunk/docs/api-notes/backup/backup_encryption.txt
   box/trunk/docs/api-notes/backup/bin_bbackupd.txt
   box/trunk/docs/api-notes/backup/bin_bbstored.txt
   box/trunk/docs/api-notes/backup/encrypt_rsync.txt
   box/trunk/docs/api-notes/backup/lib_backupclient.txt
   box/trunk/docs/api-notes/backup/lib_backupstore.txt
   box/trunk/docs/api-notes/backup/win32_build_on_cygwin_using_mingw.txt
   box/trunk/docs/api-notes/backup/win32_build_on_linux_using_mingw.txt
   box/trunk/docs/api-notes/backup/windows_porting.txt
   box/trunk/docs/api-notes/raidfile/lib_raidfile/
Log:
Rearrangement of api-notes directory.


Copied: box/trunk/docs/api-notes/INDEX.txt (from rev 2474, box/trunk/docs/api-notes/backup/INDEX.txt)
===================================================================
--- box/trunk/docs/api-notes/INDEX.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/INDEX.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,61 @@
+TITLE Programmers Notes for Box Backup
+
+This directory contains the programmers notes for the Box Backup system. They will be of interest only if you want to review or modify the code.
+
+These notes are intended to be run through a script to produce HTML at some later stage, hence the marks such as 'TITLE'.
+
+
+SUBTITLE Organisation
+
+The project is split up into several modules. The modules within 'lib' are building blocks for creation of the actual executable programs within 'bin'. 'test' contains unit tests for lib and bin modules.
+
+The file modules.txt lists the modules which are to be built, and their dependencies. It also allows for platform differences.
+
+
+SUBTITLE Documentation Organisation
+
+In this directory, the files correspond to modules or areas of interest. Sub directories of the same name contain files documenting specific classes within that module.
+
+
+SUBTITLE Suggested reading order
+
+* common/lib_common.txt
+* common/lib_server.txt
+* bin_bbackupd.txt
+* backup_encryption.txt
+* bin_bbstored.txt
+* raidfile/lib_raidfile.txt
+
+and refer to other sections as required.
+
+
+SUBTITLE Building
+
+The makefiles are generated by makebuildenv.pl. (The top level makefile is generated by makeparcels.pl, but this is for the end user to use, not a programmer.)
+
+To build a module, cd to it and type make. If the -DRELEASE option is specified (RELEASE=1 with GNU make) the release version will be built. The object files and exes are placed in a directory structure under 'release' or 'debug'.
+
+It is intended that a test will be written for everything, so in general make commands will be issued only within the test/* modules. Once it has been built, cd to debug/test/<testname> and run the test with ./t .
+
+
+SUBTITLE Programming style
+
+The code is written to be easy to write. Ease of programming is the primary concern, as this should lead to fewer bugs. Efficiency improvements can be made later when the system as a whole works.
+
+Much use is made of the STL.
+
+There is no common base class.
+
+All errors are reported using exceptions.
+
+Some of the boring code is generated by perl scripts from description files.
+
+There are a lot of modules and classes which can easily be used to build other projects in the future -- there is a lot of "framework" code.
+
+
+SUBTITLE Lots more documentation
+
+The files are extensively commented. Consider this notes as an overview, and then read the source files for detailed and definitive information.
+
+Each function and class has a very brief decsription of it's purpose in a standard header, and extensive efforts have been maed to comment the code itself.
+

Copied: box/trunk/docs/api-notes/Win32_Clients.txt (from rev 2474, box/trunk/docs/api-notes/backup/Win32_Clients.txt)
===================================================================
--- box/trunk/docs/api-notes/Win32_Clients.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/Win32_Clients.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,13 @@
+The basic client tools now run on Win32 natively. 
+The port was done by nick at omniis.com.
+
+* bbackupd
+* bbackupquery
+* bbackupctl
+
+Have been ported. bbackupd runs as a NT style service.
+
+Known limitations:
+
+* File attributes and permissions are not backed up.
+

Deleted: box/trunk/docs/api-notes/backup/INDEX.txt
===================================================================
--- box/trunk/docs/api-notes/backup/INDEX.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/INDEX.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,61 +0,0 @@
-TITLE Programmers Notes for Box Backup
-
-This directory contains the programmers notes for the Box Backup system. They will be of interest only if you want to review or modify the code.
-
-These notes are intended to be run through a script to produce HTML at some later stage, hence the marks such as 'TITLE'.
-
-
-SUBTITLE Organisation
-
-The project is split up into several modules. The modules within 'lib' are building blocks for creation of the actual executable programs within 'bin'. 'test' contains unit tests for lib and bin modules.
-
-The file modules.txt lists the modules which are to be built, and their dependencies. It also allows for platform differences.
-
-
-SUBTITLE Documentation Organisation
-
-In this directory, the files correspond to modules or areas of interest. Sub directories of the same name contain files documenting specific classes within that module.
-
-
-SUBTITLE Suggested reading order
-
-* lib_common
-* lib_server
-* bin_bbackupd
-* backup_encryption.txt
-* bin_bstored
-* lib_raidfile
-
-and refer to other sections as required.
-
-
-SUBTITLE Building
-
-The makefiles are generated by makebuildenv.pl. (The top level makefile is generated by makeparcels.pl, but this is for the end user to use, not a programmer.)
-
-To build a module, cd to it and type make. If the -DRELEASE option is specified (RELEASE=1 with GNU make) the release version will be built. The object files and exes are placed in a directory structure under 'release' or 'debug'.
-
-It is intended that a test will be written for everything, so in general make commands will be issued only within the test/* modules. Once it has been built, cd to debug/test/<testname> and run the test with ./t .
-
-
-SUBTITLE Programming style
-
-The code is written to be easy to write. Ease of programming is the primary concern, as this should lead to fewer bugs. Efficiency improvements can be made later when the system as a whole works.
-
-Much use is made of the STL.
-
-There is no common base class.
-
-All errors are reported using exceptions.
-
-Some of the boring code is generated by perl scripts from description files.
-
-There are a lot of modules and classes which can easily be used to build other projects in the future -- there is a lot of "framework" code.
-
-
-SUBTITLE Lots more documentation
-
-The files are extensively commented. Consider this notes as an overview, and then read the source files for detailed and definitive information.
-
-Each function and class has a very brief decsription of it's purpose in a standard header, and extensive efforts have been maed to comment the code itself.
-

Deleted: box/trunk/docs/api-notes/backup/Win32_Clients.txt
===================================================================
--- box/trunk/docs/api-notes/backup/Win32_Clients.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/Win32_Clients.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,13 +0,0 @@
-The basic client tools now run on Win32 natively. 
-The port was done by nick at omniis.com.
-
-* bbackupd
-* bbackupquery
-* bbackupctl
-
-Have been ported. bbackupd runs as a NT style service.
-
-Known limitations:
-
-* File attributes and permissions are not backed up.
-

Deleted: box/trunk/docs/api-notes/backup/backup_encryption.txt
===================================================================
--- box/trunk/docs/api-notes/backup/backup_encryption.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/backup_encryption.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,109 +0,0 @@
-TITLE Encryption in the backup system
-
-This document explains how everything is encrypted in the backup system, and points to the various functions which need reviewing to ensure they do actually follow this scheme.
-
-
-SUBTITLE Security objectives
-
-The crpyto system is designed to keep the following things secret from an attacker who has full access to the server.
-
-* The names of the files and directories
-* The contents of files and directories
-* The exact size of files
-
-Things which are not secret are
-
-* Directory heirarchy and number of files in each directory
-* How the files change over time
-* Approximate size of files
-
-
-SUBTITLE Keys
-
-There are four separate keys used:
-
-* Filename
-* File attributes
-* File block index
-* File data
-
-and an additional secret for file attribute hashes.
-
-The Cipher is Blowfish in CBC mode in most cases, except for the file data. All keys are maximum length 448 bit keys, since the key size only affects the setup time and this is done very infrequently.
-
-The file data is encrypted with AES in CBC mode, with a 256 bit key (max length). Blowfish is used elsewhere because the larger block size of AES, while more secure, would be terribly space inefficient. Note that Blowfish may also be used when older versions of OpenSSL are in use, and for backwards compatibility with older versions.
-
-The keys are generated using "openssl rand", and a 1k file of key material is stored in /etc/box/bbackupd. The configuration scripts make this readable only by root.
-
-Code for review: BackupClientCryptoKeys_Setup()
-in lib/backupclient/BackupClientCryptoKeys.cpp
-
-
-SUBTITLE Filenames
-
-Filenames need to be secret from the attacker, but they need to be compared on the server so it can determine whether or not is it a new version of an old file.
-
-So, the same Initialisation Vector is used for every single filename, so the same filename encrypted twice will have the same binary representation.
-
-Filenames use standard PKCS padding implemented by OpenSSL. They are proceeded by two bytes of header which describe the length, and the encoding.
-
-Code for review: BackupStoreFilenameClear::EncryptClear()
-in lib/backupclient/BackupStoreFilenameClear.cpp
-
-
-SUBTITLE File attributes
-
-These are kept secret as well, since they reveal information. Especially as they contain the target name of symbolic links.
-
-To encrypt, a random Initialisation Vector is choosen. This is stored first, followed by the attribute data encrypted with PKCS padding.
-
-Code for review: BackupClientFileAttributes::EncryptAttr()
-in lib/backupclient/BackupClientFileAttributes.cpp
-
-
-SUBTITLE File attribute hashes
-
-To detect and update file attributes efficiently, the file status change time is not used, as this would give suprious results and result in unnecessary updates to the server. Instead, a hash of user id, group id, and mode is used.
-
-To avoid revealing details about attributes
-
-1) The filename is added to the hash, so that an attacker cannot determine whether or not two files have identical attributes
-
-2) A secret is added to the hash, so that an attacker cannot compare attributes between accounts.
-
-The hash used is the first 64 bits of an MD5 hash.
-
-
-SUBTITLE File block index
-
-Files are encoded in blocks, so that the rsync algorithm can be used on them. The data is compressed first before encryption. These small blocks don't give the best possible compression, but there is no alternative because the server can't see their contents.
-
-The file contains a number of blocks, which contain among other things
-
-* Size of the block when it's not compressed
-* MD5 checksum of the block
-* RollingChecksum of the block
-
-We don't want the attacker to know the size, so the first is bad. (Because of compression and padding, there's uncertainty on the size.)
-
-When the block is only a few bytes long, the latter two reveal it's contents with only a moderate amount of work. So these need to be encrypted.
-
-In the header of the index, a 64 bit number is chosen. The sensitive parts of the block are then encrypted, without padding, with an Initialisation Vector of this 64 bit number + the block index.
-
-If a block from an previous file is included in a new version of a file, the same checksum data will be encrypted again, but with a different IV. An eavesdropper will be able to easily find out which data has been re-encrypted, but the plaintext is not revealed.
-
-Code for review: BackupStoreFileEncodeStream::Read() (IV base choosen about half-way through)
-BackupStoreFileEncodeStream::EncodeCurrentBlock() (encrypt index entry)
-in lib/backupclient/BackupStoreFileEncodeStream.cpp
-
-
-SUBTITLE File data
-
-As above, the first is split into chunks and compressed.
-
-Then, a random initialisation vector is chosen, stored first, followed by the compressed file data encrypted using PKCS padding.
-
-Code for review: BackupStoreFileEncodeStream::EncodeCurrentBlock()
-in lib/backupclient/BackupStoreFileEncodeStream.cpp
-
-

Deleted: box/trunk/docs/api-notes/backup/bin_bbackupd.txt
===================================================================
--- box/trunk/docs/api-notes/backup/bin_bbackupd.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/bin_bbackupd.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,88 +0,0 @@
-TITLE bin/bbackupd
-
-The backup client daemon.
-
-This aims to maintain as little information as possible to record which files have been uploaded to the server, while minimising the amount of queries which have to be made to the server.
-
-
-SUBTITLE Scanning
-
-The daemon is given a length of time, t, which files over this age should be uploaded to the server. This is to stop recently updated files being uploaded immediately to avoid uploading something repeatedly (on the assumption that if a file has been written, it is likely to be modified again shortly).
-
-It will scan the files at a configured interval, and connect to the server if it needs to upload files or make queries about files and directories.
-
-The scan interval is actually varied slightly between each run by adding a random number up to a 64th of the configured time. This is to reduce cyclic patterns of load on the backup servers -- otherwise if all the boxes are turned on at about 9am, every morning at 9am there will be a huge spike in load on the server.
-
-Each scan chooses a time interval, which ends at the current time - t. This will be from 0 to current time - t on the first run, then the next run takes the start time as the end time of the previous run. The scan is only performed if the difference between the start and end times is greater or equal to t.
-
-For each configured location, the client scans the directories on disc recursively.
-
-For each directory
-
-* If the directory has never been scanned before (in this invocation of the daemon) or the modified time on the directory is not that recorded, the listing on the server is downloaded.
-
-* For each file, if it's modified time is within the time period, it is uploaded. If the directory has been downloaded, it is compared against that, and only uploaded if it's changed.
-
-* Find all the new files, and upload them if they lie within the time interval.
-
-* Recurse to sub directories, creating them on the server if necessary.
-
-Hence, the first time it runs, it will download and compare the entries on the disc to those on the server, but in future runs it will use the file and directory modification times to work out if there is anything which needs uploading.
-
-If there aren't any changes, it won't even need to connect to the server.
-
-There are some extra details which allow this to work reliably, but they are documented in the source.
-
-
-SUBTITLE File attributes
-
-The backup client will update the file attributes on files as soon as it notices they are changed. It records most of the details from stat(), but only a few can be restored. Attributes will only be considered changed if the user id, group id or mode is changed. Detection is by a 64 bit hash, so detection is strictly speaking probablistic.
-
-
-SUBTITLE Encryption
-
-All the user data is encrypted. There is a separate file, backup_encryption.txt which describes this, and where in the code to look to verify it works as described.
-
-
-SUBTITLE Tracking files and directories
-
-Renaming files is a difficult problem under this minimal data scanning scheme, because you don't really know whether a file has been renamed, or another file deleted and new one created.
-
-The solution is to keep (on disc) a map of inode numbers to server object IDs for all directories and files over a certain user configurable threshold. Then, when a new file is discovered, it is first checked to see if it's in this map. If so, a rename is considered, which will take place if the local object corresponding to the name of the tracked object doesn't exist any more.
-
-Because of the renaming requirement, deletions of objects from the server are recorded and delayed until the end of the scan.
-
-
-SUBTITLE Running out of space
-
-If the store server indicates on login to the backup client, it will scan, but not upload anything nor adjust it's internal stored details of the local objects. However, deletions and renames happen.
-
-This is to allow deletions to still work and reduce the amount of storage space used on the server, in the hope that in the future there will be enough space.
-
-Just not doing anything would mean that one big file created and then deleted at the wrong time would stall the whole backup process.
-
-
-SUBTITLE BackupDaemon
-
-This is the daemon class for the backup daemon. It handles setting up of all the objects, and implements calulcation of the time intervals for the scanning.
-
-
-SUBTITLE BackupClientContext
-
-State information for the scans, including maintaining a connection to the store server if required.
-
-
-SUBTITLE BackupClientDirectoryRecord
-
-A record of state of a directory on the local filesystem. Containing the recursive scanning function, which is long and entertaining, but very necessary. It contains lots of comments which explain the exact details of what's going on.
-
-
-SUBTITLE BackupClientInodeToIDMap
-
-A implementation of a map of inode number to object ID on the server. If Berkeley DB is available on the platform, it is stored on disc, otherwise there is an in memory version which isn't so good.
-
-
-
-
-
-

Deleted: box/trunk/docs/api-notes/backup/bin_bbstored.txt
===================================================================
--- box/trunk/docs/api-notes/backup/bin_bbstored.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/bin_bbstored.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,54 +0,0 @@
-TITLE bin/bbstored
-
-The backup store daemon.
-
-Maintains a store of encrypted files, and every so often goes through deleting unnecessary data.
-
-Uses an implementation of Protocol to communicate with the backup client daemon. See bin/bbstored/backupprotocol.txt for details.
-
-
-SUBTITLE Data storage
-
-The data is arranged as a set of objects within a RaidFile disc set. Each object has a 64 bit object ID, which is turned into a filename in a mildly complex manner which ensure that directories don't have too many objects in them, but there is a minimal number of nested directories. See StoreStructure::MakeObjectFilename in lib/backupstore/StoreStructure.cpp for more details.
-
-An object can be a directory or a file. Directories contain files and other directories.
-
-Files in directories are supersceded by new versions when uploaded, but the old versions are flagged as such. A new version has a different object ID to the old version.
-
-Every so often, a housekeeping process works out what can be deleted, and deletes unnecessary files to take them below the storage limits set in the store info file.
-
-
-SUBTITLE Note about file storage and downloading
-
-There's one slight entertainment to file storage, in that the format of the file streamed depends on whether it's being downloaded or uploaded.
-
-The problem is that it contains an index of all the blocks. For efficiency in managing these blocks, they all need to be in the same place.
-
-Files are encoded and decoded as they are streamed to and from the server. With encoding, the index is only completely known at the end of the process, so it's sent last, and lives in the filesystem last.
-
-When it's downloaded, it can't be decoded without knowing the index. So the index is sent first, followed by the data.
-
-
-SUBTITLE BackupContext
-
-The context of the current connection, and the object which modifies the store.
-
-Maintains a cache of directories, to avoid reading them continuously, and keeps a track of a BackupStoreInfo object which is written back periodiocally.
-
-
-SUBTITLE BackupStoreDaemon
-
-A ServerTLS daemon which forks off a separate housekeeping process as it starts up.
-
-Handling connections is delegated to a Protocol implementation.
-
-
-SUBTITLE BackupCommands.cpp
-
-Implementation of all the commands. Work which requires writing is handled in the context, read only commands mainly in this file.
-
-
-SUBTITLE HousekeepStoreAccount
-
-A class which performs housekeeping on a single account.
-

Deleted: box/trunk/docs/api-notes/backup/encrypt_rsync.txt
===================================================================
--- box/trunk/docs/api-notes/backup/encrypt_rsync.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/encrypt_rsync.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,66 +0,0 @@
-TITLE Encrypted rsync algorithm
-
-The backup system uses a modified version of the rsync algorithm. A description of the plain algorithm can be found here:
-
-	http://samba.anu.edu.au/rsync/tech_report/
-
-The algorithm is modified to allow the server side to be encrypted, yet still benefit from the reduced bandwidth usage. For a single file transfer, the result will be only slightly less efficient than plain rsync. For a backup of a large directory, the overall bandwidth may be less due to the way the backup client daemon detects changes.
-
-This document assumes you have read the rsync document.
-
-The code is in lib/backupclient/BackupStoreFile*.*.
-
-
-SUBTITLE Blocks
-
-Each file is broken up into small blocks. These are individually compressed and encrypted, and have an entry in an index which contains, encrypted, it's weak and strong checksums and decoded plaintext size. This is all done on the client.
-
-Why not just encrypt the file, and use the standard rsync algorithm?
-
-1) Compression cannot be used, since encryption turns the file into essentially random data. This is not very compressible.
-
-2) Any modification to the file will result in all data after that in the file having different ciphertext (in any cipher mode we might want to use). Therefore the rsync algorithm will only be able to detect "same" blocks up until the first modification.  This significantly reduces the effectiveness of the process.
-
-Note that blocks are not all the same size. The last block in the file is unlikely to be a full block, and if data is inserted which is not a integral multiple of the block size, odd sized blocks need to be created. This is because the server cannot reassemble the blocks, because the contents are opaque to the server.
-
-
-SUBTITLE Modifed algorithm
-
-To produce a list of the changes to send the new version, the client requests the block index of the file. This is the same step as requesting the weak and strong checksums from the remote side with rsync.
-
-The client then decrypts the index, and builds a list of the 8 most used block sizes above a certain threshold size.
-
-The new version of the file is then scanned in exactly the same way as rsync for these 8 block sizes. If a block is found, then it is added to a list of found blocks, sorted by position in the file. If a block has already been found at that position, then the old entry is only replaced by the new entry if the new entry is a "better" (bigger) match.
-
-The block size covering the biggest file area is searched first, so that most of the file can be skipped over after the first pass without expensive checksumming.
-
-A "recipe" is then built from the found list, by trivially discarding overlapping blocks. Each entry consists of a number of bytes of "new" data, a block start number, and a number of blocks from the old file. The data is stored like this as a memory optimisation, assuming that files mostly stay the same rather than having all their blocks reordered.
-
-The file is then encoded, with new data being sent as blocks of data, and references to blocks in the old file. The new index is built completely, as the checksums and size need to be rencrypted to match their position in the index.
-
-
-SUBTITLE Combination on server
-
-The "diff" which is sent from the client is assembled into a full file on the server, simply by adding in blocks from the old file where they are specified in the block index.
-
-
-SUBTITLE Storage on server
-
-Given that the server will in general store several versions of a file, combining old and new files to form a new file is not terribly efficient on storage space. Particularly for large multi-Gb database files.
-
-An alternative scheme is outlined below, however, it is significantly more complex to implement, and so is not implemented in this version.
-
-1) In the block index of the files, store the file ID of the file which each block is source from. This allows a single file to reference blocks from many files.
-
-2) When the file is downloaded, the server combines the blocks from all the files into a new file as it is streamed to the client. (This is not particuarly complicated to do.)
-
-This all sounds fine, until housekeeping is considered. Old versions need to be deleted, without losing any blocks necessary for future versions.
-
-Instead of just deleting a file, the server works out which blocks are still required, and rebuilds the file omitting those blocks which aren't required.
-
-This complicates working out how much space a file will release when it is "deleted", and indeed, adds a whole new level of complexity to the housekeeping process. (And the tests!)
-
-The directory structure will need an additional flag, "Partial file", which specifies that the entry cannot be built as previous blocks are no longer available. Entries with this flag should never be sent to the client.
-
-
-

Deleted: box/trunk/docs/api-notes/backup/lib_backupclient.txt
===================================================================
--- box/trunk/docs/api-notes/backup/lib_backupclient.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/lib_backupclient.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,46 +0,0 @@
-TITLE lib/backupclient
-
-Classes used on the store and on the server.
-
-See documentation in the files for more details.
-
-
-SUBTITLE BackupStoreDirectory
-
-The directory listing class, containing a number of entries, representing files.
-
-
-SUBTITLE BackupStoreFile
-
-Handles compressing and encrypting files, and decoding files downloaded from the server.
-
-
-SUBTITLE BackupStoreFilename
-
-An encrypted filename.
-
-
-SUBTITLE BackupStoreFilenameClear
-
-Derived from BackupStoreFilename, but with the ability to encrypt and decrypt filenames. Client side only.
-
-
-SUBTITLE BackupClientFileAttributes
-
-Only used on the client -- the server treats attributes as blocks of opaque data.
-
-This reads attributes from files on discs, stores them, encrypts them, and applies them to new files.
-
-Also has a static function to generate filename attribute hashes given a struct stat and the filename.
-
-
-SUBTITLE BackupClientRestore
-
-Routines to restore files from the server onto the client filesystem.
-
-
-SUBTITLE BackupClientCryptoKeys
-
-This reads the key material from disc, and sets up the crypto for storing files, attributes and directories.
-
-

Deleted: box/trunk/docs/api-notes/backup/lib_backupstore.txt
===================================================================
--- box/trunk/docs/api-notes/backup/lib_backupstore.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/lib_backupstore.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,30 +0,0 @@
-TITLE lib/backupstore
-
-Classes which are shared amongst the server side store utilities, bbstored and bbstoreaccounts. Note also depends on lib/backupclient, as a lot of code is shared between the client and server.
-
-
-SUBTITLE BackupStoreAccountDatabase
-
-A simple implementation of an account database. This will be replaced with a more suitable implementation.
-
-
-SUBTITLE BackupStoreAccounts
-
-An interface to the account database, and knowledge of how to initialise an account on disc.
-
-
-SUBTITLE BackupStoreConfigVerify
-
-The same configuration file is used by all the utilities. This is the Configuration verification structure for this file.
-
-
-SUBTITLE BackupStoreInfo
-
-The "header" information about an account, specifying current disc usage, space limits, etc.
-
-
-SUBTITLE StoreStructure
-
-Functions specifying how the files are laid out on disc in the store.
-
-

Deleted: box/trunk/docs/api-notes/backup/win32_build_on_cygwin_using_mingw.txt
===================================================================
--- box/trunk/docs/api-notes/backup/win32_build_on_cygwin_using_mingw.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/win32_build_on_cygwin_using_mingw.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,53 +0,0 @@
-How to build Box Backup on Win32 using Cygwin and MinGW
-By Chris Wilson, 2007-05-26
-
-(To read this document online with better formatting, browse to:
-http://www.boxbackup.org/trac/wiki/CompileWithMinGW)
-
-Start by installing Cygwin on your Windows machine [http://www.cygwin.org].
-Make sure to select the following packages during installation:
-
-* Devel/gcc-mingw
-* Devel/gcc-mingw-core
-* Devel/gcc-mingw-g++
-* Mingw/mingw-zlib
-
-If you already have Cygwin installed, please re-run the installer and
-ensure that those packages are installed.
-
-Download OpenSSL from 
-[http://www.openssl.org/source/openssl-0.9.7i.tar.gz]
-
-Open a Cygwin shell, and unpack OpenSSL:
-
-	tar xzvf openssl-0.9.7i.tar.gz
-
-Configure OpenSSL for MinGW compilation, and build and install it:
-
-	cd openssl-0.9.7i
-	./Configure --prefix=/usr/i686-pc-mingw32/ mingw
-	make
-	make install
-
-Download PCRE from 
-[http://prdownloads.sourceforge.net/pcre/pcre-6.3.tar.bz2?download]
-
-Open a Cygwin shell, and unpack PCRE:
-
-	tar xjvf pcre-6.3.tar.bz2
-
-Configure PCRE for MinGW compilation, and build and install it:
-	
-	cd pcre-6.3
-	export CFLAGS="-mno-cygwin"
-	./configure
-	make winshared
-	cp .libs/libpcre.a .libs/libpcreposix.a /usr/lib/mingw
-	cp pcreposix.h /usr/include/mingw
-
-Now unpack the Box Backup sources, enter the source directory,
-and configure like this:
-
-	./infrastructure/mingw/configure.sh
-	make
-

Deleted: box/trunk/docs/api-notes/backup/win32_build_on_linux_using_mingw.txt
===================================================================
--- box/trunk/docs/api-notes/backup/win32_build_on_linux_using_mingw.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/win32_build_on_linux_using_mingw.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,108 +0,0 @@
-How to build Box Backup for Windows (Native) on Linux using MinGW
-By Chris Wilson, 2005-12-07
-
-Install the MinGW cross-compiler for Windows:
-
-- Debian and Ubuntu users can "apt-get install mingw32"
-- Fedora and SuSE users can download RPM packages from 
-  [http://mirzam.it.vu.nl/mingw/]
-
-You will need to know the prefix used by the cross-compiler executables.
-It will usually be something like "ix86-mingw32*-". All the binaries in the
-cross-compiler package will start with this prefix. The documentation below
-assumes that it is "i386-mingw32-". Adjust to taste.
-
-You will also need to install Wine and the Linux kernel "binary formats"
-(binfmt) support, so that you can run Windows executables on Linux,
-otherwise the configure scripts will not work properly with a cross-compiler.
-On Ubuntu, run:
-
-	apt-get install wine binfmt-support
-	/etc/init.d/binfmt-support start
-
-Start by downloading Zlib from [http://www.zlib.net/], unpack and enter
-source directory:
-
-	export CC=i386-mingw32-gcc 
-	export AR="i386-mingw32-ar rc" 
-	export RANLIB="i386-mingw32-ranlib"
-	./configure
-	make
-	make install prefix=/usr/local/i386-mingw32
-
-Download OpenSSL 0.9.8b from 
-[http://www.openssl.org/source/openssl-0.9.8b.tar.gz]
-
-Unpack and configure:
-
-	tar xzvf openssl-0.9.8b.tar.gz
-	cd openssl-0.9.8b
-	./Configure --prefix=/usr/local/i386-mingw32 mingw
-	make makefile.one
-	wget http://www.boxbackup.org/svn/box/chris/win32/support/openssl-0.9.8b-mingw-cross.patch
-	patch -p1 < openssl-0.9.8b-mingw-cross.patch
-	make -f makefile.one
-	make -f makefile.one install
-
-Download PCRE from 
-[http://prdownloads.sourceforge.net/pcre/pcre-6.3.tar.bz2?download]
-
-Unpack:
-
-	tar xjvf pcre-6.3.tar.bz2
-	cd pcre-6.3
-
-Configure and make:
-
-	export AR=i386-mingw32-ar
-	./configure --host=i386-mingw32 --prefix=/usr/local/i386-mingw32/
-	make winshared
-
-If you get this error:
-
-	./dftables.exe pcre_chartables.c
-	/bin/bash: ./dftables.exe: cannot execute binary file
-	make: *** [pcre_chartables.c] Error 126
-
-then run:
-
-	wine ./dftables.exe pcre_chartables.c
-	make winshared
-
-to complete the build. Finally:
-
-	cp .libs/libpcre.a /usr/local/i386-pc-mingw32/lib
-	cp .libs/libpcreposix.a /usr/local/i386-pc-mingw32/lib
-	cp pcreposix.h /usr/local/i386-pc-mingw32/include
-
-You will need to find a copy of mingwm10.dll that matches your cross-compiler.
-Most MinGW distributions should come with it. On Debian and Ubuntu, for some
-bizarre reason, you'll find it compressed as
-/usr/share/doc/mingw32-runtime/mingwm10.dll.gz, in which case you'll
-have to un-gzip it with "gzip -d". Copy it to a known location, e.g.
-/usr/local/i386-mingw32/bin.
-
-Download and extract Box Backup, and change into the base directory,
-e.g. boxbackup-0.11rc2. Change the path to mingwm10.dll in parcels.txt to
-match where you found or installed it.
-
-Now configure Box with:
-
-	./configure --host=i386-mingw32 \
-		CXXFLAGS="-mthreads -I/usr/local/i386-mingw32/include" \
-		LDFLAGS=" -mthreads -L/usr/local/i386-mingw32/lib" \
-		LIBS="-lcrypto -lws2_32 -lgdi32"
-	make
-
-or, if that fails, try this:
-
-	export CXX="i386-mingw32-g++"
-	export AR=i386-mingw32-ar
-	export RANLIB=i386-mingw32-ranlib
-	export CFLAGS="-mthreads"
-	export CXXFLAGS="-mthreads"
-	export LDFLAGS="-mthreads"
-	export LIBS="-lcrypto -lws2_32 -lgdi32"
-	(if you don't have a "configure" file, run "./bootstrap")
-	./configure --target=i386-mingw32
-	make CXX="$CXX" AR="$AR" RANLIB="$RANLIB" WINDRES="i386-mingw32-windres"

Deleted: box/trunk/docs/api-notes/backup/windows_porting.txt
===================================================================
--- box/trunk/docs/api-notes/backup/windows_porting.txt	2009-03-28 15:41:09 UTC (rev 2477)
+++ box/trunk/docs/api-notes/backup/windows_porting.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -1,100 +0,0 @@
-TITLE Notes on porting the backup client to Windows
-
-It should be relatively easy to port the backup client (bbackupd) to Windows. However, the server relies on unlink() behaviour of the UNIX filesystem which is different on the Windows platform, so it will not be so easy to port.
-
-An installation of perl is required to build the system. The ActiveState port is the easiest to install.
-
-
-SUBTITLE Build environment
-
-The build environment generation script, makebuildenv.pl, uses perl to scan all the files and generate makefiles. It's rather orientated towards UNIX systems with gcc.
-
-Probably the easiest way to handle this is to detect the Windows platform, set up a few variables appropriately (in BoxPlatform.pm) and then post-process the generated Makefiles to mould them into something more handy for the MS Windows compiler and toolchain.
-
-The script itself is a bit messy. It was fine at first, but then the multi-platform thing got in the way. I do intend to rewrite it at some point in the future.
-
-Make sure your new version defines PLATFORM_WIN32 on the compile line.
-
-All files #include "Box.h" as the first include file. Use this for pre-compiled headers. Edit BoxPlatform.h to include the Windows headers required, and include a PLATFORM_WIN32 section. The easiest start will be to leave this blank, apart from typedefs for the basic types and any "not supported" #defines you can find.
-
-Boring bits of the code, such as exceptions and protocol definitions, are autogenerated using perl scripts. This code should be portable without modification.
-
-I have tried to avoid the things I know won't work with the MS compiler, so hopefully the code will be fairly clean. However, it might be a little easier to use the MinGW compiler [ http://www.mingw.org/ ] just to be consistent with the UNIX version. But I hope this won't be necessary.
-
-You'll need the latest version of OpenSSL. This was slightly difficult to get to compile last time I tried -- especially if you're determined to use the optimised assembler version. The main difficulty was getting a version which would link properly with the options used in my project, the default libraries selected got in the way.
-
-
-SUBTITLE Porting as UNIX emulation
-
-Since the daemon uses so few UNIX system calls and with a limited set of options, it seems to make sense to port it by writing emulations of these functions. It's probably nicest to create a lib/win32 directory, and populate this with header files corresponding to the UNIX header files used. These just contain inline functions which map the UNIX calls to Win32 calls.
-
-File/socket handles may have to be translated. -1 is used as a failure return value and by the code internally to mark an invalid socket handle. (0 is a valid socket handle)
-
-Of course, some bits of code aren't relevant, so will just be #ifdefed out, or replaced. But this should be minimal. (Only perhaps the small bit relating to filesystem structure -- there aren't really mount points as such.)
-
-
-SUBTITLE File tracking
-
-The daemon uses the inode number of a file to keep track of files and directories, so when they're renamed they can be moved efficiently on the store. Some unique (per filesystem) number will have to be found and used instead.
-
-It uses the Berkeley DB to store these on disc. It's likely another storage system will need to be used. (It just has to map the file's unique number into to a 8 byte struct.)
-
-There is a in-memory implementation for platforms which don't support Berkeley DB, but this isn't so good when the daemon has to be restarted as all the tracking is lost. But it's an easy start.
-
-
-SUBTITLE Expected filesystem behaviour
-
-File and directories have (at least) two modification times, for contents and attributes.
-
-For files, the contents modification time must change when the contents change, and the attributes time when the attributes change (and may change when the contents change too.)
-
-For directories, the contents modification time must change when files or directories are deleted or added. If it changes any more frequently than this, then the client will be slightly less efficient -- it will download the store's directory listing whenever this time changes. The attributes modification time is less important, as the actual attributes are compared and only uploaded if different.
-
-
-SUBTITLE Attributes
-
-Attributes means file modification times, flags, and filesystem permissions.
-
-The BackupClientFileAttribute class will need to be extended. Allocate another "attribute type" for the Win32 attributes, and then serialise it in a compatible way -- put your new attribute type in the header, and then a serialised network byte order structure in the rest. The different size of block is handled for you, and the server never looks inside.
-
-Add code so that under UNIX, Win32 attributes are ignored, and UNIX attributes under Win32.
-
-It's probably not necessary to worry too much about these for the first version. Not many people seem to use these attributes anyway.
-
-
-SUBTITLE Times
-
-The system uses it's own 64 bit time type -- see BoxTime.h. Everything is translated to this from the various different system time types, and calculated and stored internally in this form.
-
-
-SUBTITLE Daemon as a Service
-
-The client is derived from the Daemon class, which implements a daemon. The interface is simple, and it shouldn't be hard to write a compatible class which implements a Windows Service instead.
-
-Or cheat and run it as a Win32 application.
-
-Note that the daemon expects to be able to read every file it wants, and will abort a scan and upload run if it gets an error. The daemon must therefore be run with sufficient privileges. It runs as root under UNIX.
-
-
-SUBTITLE Command Socket
-
-The backup daemon accepts commands from bbackupctl through a UNIX domain socket. When a connection is made, the user ID of the connecting process is checked to see if it's the same user ID as the daemon is running under.
-
-This may not have any exact analogue under Win32, so another communications scheme may have to be devised.
-
-This is only actually necessary if the client is to be run in snapshot mode. It can be safely left unimplemented if snapshot mode is not required, or the prompts for it to sync with the server are implemented some other way.
-
-
-SUBTITLE NTFS streams
-
-If you want to back up NTFS streams, then a generic solution should probably be defined, so that the Mac OS X resource forks can be backed up with the same mechanism.
-
-
-SUBTITLE Source code
-
-I work on a slightly different version of the source files. A make distribution script adds the license header and removes private sections of code. This means submitted diffs need a slight bit of translation.
-
-
-
-
-

Copied: box/trunk/docs/api-notes/backup_encryption.txt (from rev 2474, box/trunk/docs/api-notes/backup/backup_encryption.txt)
===================================================================
--- box/trunk/docs/api-notes/backup_encryption.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/backup_encryption.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,109 @@
+TITLE Encryption in the backup system
+
+This document explains how everything is encrypted in the backup system, and points to the various functions which need reviewing to ensure they do actually follow this scheme.
+
+
+SUBTITLE Security objectives
+
+The crpyto system is designed to keep the following things secret from an attacker who has full access to the server.
+
+* The names of the files and directories
+* The contents of files and directories
+* The exact size of files
+
+Things which are not secret are
+
+* Directory heirarchy and number of files in each directory
+* How the files change over time
+* Approximate size of files
+
+
+SUBTITLE Keys
+
+There are four separate keys used:
+
+* Filename
+* File attributes
+* File block index
+* File data
+
+and an additional secret for file attribute hashes.
+
+The Cipher is Blowfish in CBC mode in most cases, except for the file data. All keys are maximum length 448 bit keys, since the key size only affects the setup time and this is done very infrequently.
+
+The file data is encrypted with AES in CBC mode, with a 256 bit key (max length). Blowfish is used elsewhere because the larger block size of AES, while more secure, would be terribly space inefficient. Note that Blowfish may also be used when older versions of OpenSSL are in use, and for backwards compatibility with older versions.
+
+The keys are generated using "openssl rand", and a 1k file of key material is stored in /etc/box/bbackupd. The configuration scripts make this readable only by root.
+
+Code for review: BackupClientCryptoKeys_Setup()
+in lib/backupclient/BackupClientCryptoKeys.cpp
+
+
+SUBTITLE Filenames
+
+Filenames need to be secret from the attacker, but they need to be compared on the server so it can determine whether or not is it a new version of an old file.
+
+So, the same Initialisation Vector is used for every single filename, so the same filename encrypted twice will have the same binary representation.
+
+Filenames use standard PKCS padding implemented by OpenSSL. They are proceeded by two bytes of header which describe the length, and the encoding.
+
+Code for review: BackupStoreFilenameClear::EncryptClear()
+in lib/backupclient/BackupStoreFilenameClear.cpp
+
+
+SUBTITLE File attributes
+
+These are kept secret as well, since they reveal information. Especially as they contain the target name of symbolic links.
+
+To encrypt, a random Initialisation Vector is choosen. This is stored first, followed by the attribute data encrypted with PKCS padding.
+
+Code for review: BackupClientFileAttributes::EncryptAttr()
+in lib/backupclient/BackupClientFileAttributes.cpp
+
+
+SUBTITLE File attribute hashes
+
+To detect and update file attributes efficiently, the file status change time is not used, as this would give suprious results and result in unnecessary updates to the server. Instead, a hash of user id, group id, and mode is used.
+
+To avoid revealing details about attributes
+
+1) The filename is added to the hash, so that an attacker cannot determine whether or not two files have identical attributes
+
+2) A secret is added to the hash, so that an attacker cannot compare attributes between accounts.
+
+The hash used is the first 64 bits of an MD5 hash.
+
+
+SUBTITLE File block index
+
+Files are encoded in blocks, so that the rsync algorithm can be used on them. The data is compressed first before encryption. These small blocks don't give the best possible compression, but there is no alternative because the server can't see their contents.
+
+The file contains a number of blocks, which contain among other things
+
+* Size of the block when it's not compressed
+* MD5 checksum of the block
+* RollingChecksum of the block
+
+We don't want the attacker to know the size, so the first is bad. (Because of compression and padding, there's uncertainty on the size.)
+
+When the block is only a few bytes long, the latter two reveal it's contents with only a moderate amount of work. So these need to be encrypted.
+
+In the header of the index, a 64 bit number is chosen. The sensitive parts of the block are then encrypted, without padding, with an Initialisation Vector of this 64 bit number + the block index.
+
+If a block from an previous file is included in a new version of a file, the same checksum data will be encrypted again, but with a different IV. An eavesdropper will be able to easily find out which data has been re-encrypted, but the plaintext is not revealed.
+
+Code for review: BackupStoreFileEncodeStream::Read() (IV base choosen about half-way through)
+BackupStoreFileEncodeStream::EncodeCurrentBlock() (encrypt index entry)
+in lib/backupclient/BackupStoreFileEncodeStream.cpp
+
+
+SUBTITLE File data
+
+As above, the first is split into chunks and compressed.
+
+Then, a random initialisation vector is chosen, stored first, followed by the compressed file data encrypted using PKCS padding.
+
+Code for review: BackupStoreFileEncodeStream::EncodeCurrentBlock()
+in lib/backupclient/BackupStoreFileEncodeStream.cpp
+
+

Copied: box/trunk/docs/api-notes/bin_bbackupd.txt (from rev 2474, box/trunk/docs/api-notes/backup/bin_bbackupd.txt)
===================================================================
--- box/trunk/docs/api-notes/bin_bbackupd.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/bin_bbackupd.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,88 @@
+TITLE bin/bbackupd
+
+The backup client daemon.
+
+This aims to maintain as little information as possible to record which files have been uploaded to the server, while minimising the amount of queries which have to be made to the server.
+
+
+SUBTITLE Scanning
+
+The daemon is given a length of time, t, which files over this age should be uploaded to the server. This is to stop recently updated files being uploaded immediately to avoid uploading something repeatedly (on the assumption that if a file has been written, it is likely to be modified again shortly).
+
+It will scan the files at a configured interval, and connect to the server if it needs to upload files or make queries about files and directories.
+
+The scan interval is actually varied slightly between each run by adding a random number up to a 64th of the configured time. This is to reduce cyclic patterns of load on the backup servers -- otherwise if all the boxes are turned on at about 9am, every morning at 9am there will be a huge spike in load on the server.
+
+Each scan chooses a time interval, which ends at the current time - t. This will be from 0 to current time - t on the first run, then the next run takes the start time as the end time of the previous run. The scan is only performed if the difference between the start and end times is greater or equal to t.
+
+For each configured location, the client scans the directories on disc recursively.
+
+For each directory
+
+* If the directory has never been scanned before (in this invocation of the daemon) or the modified time on the directory is not that recorded, the listing on the server is downloaded.
+
+* For each file, if it's modified time is within the time period, it is uploaded. If the directory has been downloaded, it is compared against that, and only uploaded if it's changed.
+
+* Find all the new files, and upload them if they lie within the time interval.
+
+* Recurse to sub directories, creating them on the server if necessary.
+
+Hence, the first time it runs, it will download and compare the entries on the disc to those on the server, but in future runs it will use the file and directory modification times to work out if there is anything which needs uploading.
+
+If there aren't any changes, it won't even need to connect to the server.
+
+There are some extra details which allow this to work reliably, but they are documented in the source.
+
+
+SUBTITLE File attributes
+
+The backup client will update the file attributes on files as soon as it notices they are changed. It records most of the details from stat(), but only a few can be restored. Attributes will only be considered changed if the user id, group id or mode is changed. Detection is by a 64 bit hash, so detection is strictly speaking probablistic.
+
+
+SUBTITLE Encryption
+
+All the user data is encrypted. There is a separate file, backup_encryption.txt which describes this, and where in the code to look to verify it works as described.
+
+
+SUBTITLE Tracking files and directories
+
+Renaming files is a difficult problem under this minimal data scanning scheme, because you don't really know whether a file has been renamed, or another file deleted and new one created.
+
+The solution is to keep (on disc) a map of inode numbers to server object IDs for all directories and files over a certain user configurable threshold. Then, when a new file is discovered, it is first checked to see if it's in this map. If so, a rename is considered, which will take place if the local object corresponding to the name of the tracked object doesn't exist any more.
+
+Because of the renaming requirement, deletions of objects from the server are recorded and delayed until the end of the scan.
+
+
+SUBTITLE Running out of space
+
+If the store server indicates on login to the backup client, it will scan, but not upload anything nor adjust it's internal stored details of the local objects. However, deletions and renames happen.
+
+This is to allow deletions to still work and reduce the amount of storage space used on the server, in the hope that in the future there will be enough space.
+
+Just not doing anything would mean that one big file created and then deleted at the wrong time would stall the whole backup process.
+
+
+SUBTITLE BackupDaemon
+
+This is the daemon class for the backup daemon. It handles setting up of all the objects, and implements calulcation of the time intervals for the scanning.
+
+
+SUBTITLE BackupClientContext
+
+State information for the scans, including maintaining a connection to the store server if required.
+
+
+SUBTITLE BackupClientDirectoryRecord
+
+A record of state of a directory on the local filesystem. Containing the recursive scanning function, which is long and entertaining, but very necessary. It contains lots of comments which explain the exact details of what's going on.
+
+
+SUBTITLE BackupClientInodeToIDMap
+
+A implementation of a map of inode number to object ID on the server. If Berkeley DB is available on the platform, it is stored on disc, otherwise there is an in memory version which isn't so good.
+
+
+
+
+
+

Copied: box/trunk/docs/api-notes/bin_bbstored.txt (from rev 2474, box/trunk/docs/api-notes/backup/bin_bbstored.txt)
===================================================================
--- box/trunk/docs/api-notes/bin_bbstored.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/bin_bbstored.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,54 @@
+TITLE bin/bbstored
+
+The backup store daemon.
+
+Maintains a store of encrypted files, and every so often goes through deleting unnecessary data.
+
+Uses an implementation of Protocol to communicate with the backup client daemon. See bin/bbstored/backupprotocol.txt for details.
+
+
+SUBTITLE Data storage
+
+The data is arranged as a set of objects within a RaidFile disc set. Each object has a 64 bit object ID, which is turned into a filename in a mildly complex manner which ensure that directories don't have too many objects in them, but there is a minimal number of nested directories. See StoreStructure::MakeObjectFilename in lib/backupstore/StoreStructure.cpp for more details.
+
+An object can be a directory or a file. Directories contain files and other directories.
+
+Files in directories are supersceded by new versions when uploaded, but the old versions are flagged as such. A new version has a different object ID to the old version.
+
+Every so often, a housekeeping process works out what can be deleted, and deletes unnecessary files to take them below the storage limits set in the store info file.
+
+
+SUBTITLE Note about file storage and downloading
+
+There's one slight entertainment to file storage, in that the format of the file streamed depends on whether it's being downloaded or uploaded.
+
+The problem is that it contains an index of all the blocks. For efficiency in managing these blocks, they all need to be in the same place.
+
+Files are encoded and decoded as they are streamed to and from the server. With encoding, the index is only completely known at the end of the process, so it's sent last, and lives in the filesystem last.
+
+When it's downloaded, it can't be decoded without knowing the index. So the index is sent first, followed by the data.
+
+
+SUBTITLE BackupContext
+
+The context of the current connection, and the object which modifies the store.
+
+Maintains a cache of directories, to avoid reading them continuously, and keeps a track of a BackupStoreInfo object which is written back periodiocally.
+
+
+SUBTITLE BackupStoreDaemon
+
+A ServerTLS daemon which forks off a separate housekeeping process as it starts up.
+
+Handling connections is delegated to a Protocol implementation.
+
+
+SUBTITLE BackupCommands.cpp
+
+Implementation of all the commands. Work which requires writing is handled in the context, read only commands mainly in this file.
+
+
+SUBTITLE HousekeepStoreAccount
+
+A class which performs housekeeping on a single account.
+

Copied: box/trunk/docs/api-notes/encrypt_rsync.txt (from rev 2474, box/trunk/docs/api-notes/backup/encrypt_rsync.txt)
===================================================================
--- box/trunk/docs/api-notes/encrypt_rsync.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/encrypt_rsync.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,66 @@
+TITLE Encrypted rsync algorithm
+
+The backup system uses a modified version of the rsync algorithm. A description of the plain algorithm can be found here:
+
+	http://samba.anu.edu.au/rsync/tech_report/
+
+The algorithm is modified to allow the server side to be encrypted, yet still benefit from the reduced bandwidth usage. For a single file transfer, the result will be only slightly less efficient than plain rsync. For a backup of a large directory, the overall bandwidth may be less due to the way the backup client daemon detects changes.
+
+This document assumes you have read the rsync document.
+
+The code is in lib/backupclient/BackupStoreFile*.*.
+
+
+SUBTITLE Blocks
+
+Each file is broken up into small blocks. These are individually compressed and encrypted, and have an entry in an index which contains, encrypted, it's weak and strong checksums and decoded plaintext size. This is all done on the client.
+
+Why not just encrypt the file, and use the standard rsync algorithm?
+
+1) Compression cannot be used, since encryption turns the file into essentially random data. This is not very compressible.
+
+2) Any modification to the file will result in all data after that in the file having different ciphertext (in any cipher mode we might want to use). Therefore the rsync algorithm will only be able to detect "same" blocks up until the first modification.  This significantly reduces the effectiveness of the process.
+
+Note that blocks are not all the same size. The last block in the file is unlikely to be a full block, and if data is inserted which is not a integral multiple of the block size, odd sized blocks need to be created. This is because the server cannot reassemble the blocks, because the contents are opaque to the server.
+
+
+SUBTITLE Modifed algorithm
+
+To produce a list of the changes to send the new version, the client requests the block index of the file. This is the same step as requesting the weak and strong checksums from the remote side with rsync.
+
+The client then decrypts the index, and builds a list of the 8 most used block sizes above a certain threshold size.
+
+The new version of the file is then scanned in exactly the same way as rsync for these 8 block sizes. If a block is found, then it is added to a list of found blocks, sorted by position in the file. If a block has already been found at that position, then the old entry is only replaced by the new entry if the new entry is a "better" (bigger) match.
+
+The block size covering the biggest file area is searched first, so that most of the file can be skipped over after the first pass without expensive checksumming.
+
+A "recipe" is then built from the found list, by trivially discarding overlapping blocks. Each entry consists of a number of bytes of "new" data, a block start number, and a number of blocks from the old file. The data is stored like this as a memory optimisation, assuming that files mostly stay the same rather than having all their blocks reordered.
+
+The file is then encoded, with new data being sent as blocks of data, and references to blocks in the old file. The new index is built completely, as the checksums and size need to be rencrypted to match their position in the index.
+
+
+SUBTITLE Combination on server
+
+The "diff" which is sent from the client is assembled into a full file on the server, simply by adding in blocks from the old file where they are specified in the block index.
+
+
+SUBTITLE Storage on server
+
+Given that the server will in general store several versions of a file, combining old and new files to form a new file is not terribly efficient on storage space. Particularly for large multi-Gb database files.
+
+An alternative scheme is outlined below, however, it is significantly more complex to implement, and so is not implemented in this version.
+
+1) In the block index of the files, store the file ID of the file which each block is source from. This allows a single file to reference blocks from many files.
+
+2) When the file is downloaded, the server combines the blocks from all the files into a new file as it is streamed to the client. (This is not particuarly complicated to do.)
+
+This all sounds fine, until housekeeping is considered. Old versions need to be deleted, without losing any blocks necessary for future versions.
+
+Instead of just deleting a file, the server works out which blocks are still required, and rebuilds the file omitting those blocks which aren't required.
+
+This complicates working out how much space a file will release when it is "deleted", and indeed, adds a whole new level of complexity to the housekeeping process. (And the tests!)
+
+The directory structure will need an additional flag, "Partial file", which specifies that the entry cannot be built as previous blocks are no longer available. Entries with this flag should never be sent to the client.
+
+
+

Copied: box/trunk/docs/api-notes/lib_backupclient.txt (from rev 2474, box/trunk/docs/api-notes/backup/lib_backupclient.txt)
===================================================================
--- box/trunk/docs/api-notes/lib_backupclient.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/lib_backupclient.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,46 @@
+TITLE lib/backupclient
+
+Classes used on the store and on the server.
+
+See documentation in the files for more details.
+
+
+SUBTITLE BackupStoreDirectory
+
+The directory listing class, containing a number of entries, representing files.
+
+
+SUBTITLE BackupStoreFile
+
+Handles compressing and encrypting files, and decoding files downloaded from the server.
+
+
+SUBTITLE BackupStoreFilename
+
+An encrypted filename.
+
+
+SUBTITLE BackupStoreFilenameClear
+
+Derived from BackupStoreFilename, but with the ability to encrypt and decrypt filenames. Client side only.
+
+
+SUBTITLE BackupClientFileAttributes
+
+Only used on the client -- the server treats attributes as blocks of opaque data.
+
+This reads attributes from files on discs, stores them, encrypts them, and applies them to new files.
+
+Also has a static function to generate filename attribute hashes given a struct stat and the filename.
+
+
+SUBTITLE BackupClientRestore
+
+Routines to restore files from the server onto the client filesystem.
+
+
+SUBTITLE BackupClientCryptoKeys
+
+This reads the key material from disc, and sets up the crypto for storing files, attributes and directories.
+
+

Copied: box/trunk/docs/api-notes/lib_backupstore.txt (from rev 2474, box/trunk/docs/api-notes/backup/lib_backupstore.txt)
===================================================================
--- box/trunk/docs/api-notes/lib_backupstore.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/lib_backupstore.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,30 @@
+TITLE lib/backupstore
+
+Classes which are shared amongst the server side store utilities, bbstored and bbstoreaccounts. Note also depends on lib/backupclient, as a lot of code is shared between the client and server.
+
+
+SUBTITLE BackupStoreAccountDatabase
+
+A simple implementation of an account database. This will be replaced with a more suitable implementation.
+
+
+SUBTITLE BackupStoreAccounts
+
+An interface to the account database, and knowledge of how to initialise an account on disc.
+
+
+SUBTITLE BackupStoreConfigVerify
+
+The same configuration file is used by all the utilities. This is the Configuration verification structure for this file.
+
+
+SUBTITLE BackupStoreInfo
+
+The "header" information about an account, specifying current disc usage, space limits, etc.
+
+
+SUBTITLE StoreStructure
+
+Functions specifying how the files are laid out on disc in the store.
+
+

Copied: box/trunk/docs/api-notes/raidfile/RaidFileRead.txt (from rev 2474, box/trunk/docs/api-notes/raidfile/lib_raidfile/RaidFileRead.txt)
===================================================================
--- box/trunk/docs/api-notes/raidfile/RaidFileRead.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/raidfile/RaidFileRead.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,14 @@
+CLASS RaidFileRead
+
+Read a raid file.
+
+IOStream interface, plus a few extras, including reading directories and checking that files exist.
+
+
+FUNCTION RaidFileRead::Open
+
+Open a given raid file -- returns a pointer to a new RaidFileRead object.
+
+Note that one of two types could be returned, depending on the representation of the file.
+
+

Copied: box/trunk/docs/api-notes/raidfile/RaidFileWrite.txt (from rev 2474, box/trunk/docs/api-notes/raidfile/lib_raidfile/RaidFileWrite.txt)
===================================================================
--- box/trunk/docs/api-notes/raidfile/RaidFileWrite.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/raidfile/RaidFileWrite.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,36 @@
+CLASS RaidFileWrite
+
+Interface to writing raidfiles.
+
+See IOStream interface.
+
+Some other useful functions are available, see h and cpp files.
+
+
+FUNCTION RaidFileWrite::RaidFileWrite()
+
+The constructor takes the disc set number and filename of the file you're interested.
+
+
+FUNCTION RaidFileWrite::Open()
+
+Open() opens the file for writing, and takes an "allow overwrite" flag.
+
+
+FUNCTION RaidFileWrite::Commit()
+
+Commmit the file, and make it visible to RaidFileRead. If ConvertToRaidNow == true, it will be converted to raid file representation immediately.
+
+Setting it to false is not a good idea. Later on, it will tell a daemon to convert it in the background, but for now it simply won't be converted.
+
+
+FUNCTION RaidFileWrite::Discard()
+
+Abort the creation/update. Equivalent to just deleting the object without calling Commit().
+
+
+FUNCTION RaidFileWrite::Delete()
+
+Delete a file -- don't need to Open() it first.
+
+

Copied: box/trunk/docs/api-notes/win32_build_on_cygwin_using_mingw.txt (from rev 2474, box/trunk/docs/api-notes/backup/win32_build_on_cygwin_using_mingw.txt)
===================================================================
--- box/trunk/docs/api-notes/win32_build_on_cygwin_using_mingw.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/win32_build_on_cygwin_using_mingw.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,53 @@
+How to build Box Backup on Win32 using Cygwin and MinGW
+By Chris Wilson, 2007-05-26
+
+(To read this document online with better formatting, browse to:
+http://www.boxbackup.org/trac/wiki/CompileWithMinGW)
+
+Start by installing Cygwin on your Windows machine [http://www.cygwin.org].
+Make sure to select the following packages during installation:
+
+* Devel/gcc-mingw
+* Devel/gcc-mingw-core
+* Devel/gcc-mingw-g++
+* Mingw/mingw-zlib
+
+If you already have Cygwin installed, please re-run the installer and
+ensure that those packages are installed.
+
+Download OpenSSL from 
+[http://www.openssl.org/source/openssl-0.9.7i.tar.gz]
+
+Open a Cygwin shell, and unpack OpenSSL:
+
+	tar xzvf openssl-0.9.7i.tar.gz
+
+Configure OpenSSL for MinGW compilation, and build and install it:
+
+	cd openssl-0.9.7i
+	./Configure --prefix=/usr/i686-pc-mingw32/ mingw
+	make
+	make install
+
+Download PCRE from 
+[http://prdownloads.sourceforge.net/pcre/pcre-6.3.tar.bz2?download]
+
+Open a Cygwin shell, and unpack PCRE:
+
+	tar xjvf pcre-6.3.tar.bz2
+
+Configure PCRE for MinGW compilation, and build and install it:
+	
+	cd pcre-6.3
+	export CFLAGS="-mno-cygwin"
+	./configure
+	make winshared
+	cp .libs/libpcre.a .libs/libpcreposix.a /usr/lib/mingw
+	cp pcreposix.h /usr/include/mingw
+
+Now unpack the Box Backup sources, enter the source directory,
+and configure like this:
+
+	./infrastructure/mingw/configure.sh
+	make
+

Copied: box/trunk/docs/api-notes/win32_build_on_linux_using_mingw.txt (from rev 2474, box/trunk/docs/api-notes/backup/win32_build_on_linux_using_mingw.txt)
===================================================================
--- box/trunk/docs/api-notes/win32_build_on_linux_using_mingw.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/win32_build_on_linux_using_mingw.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,108 @@
+How to build Box Backup for Windows (Native) on Linux using MinGW
+By Chris Wilson, 2005-12-07
+
+Install the MinGW cross-compiler for Windows:
+
+- Debian and Ubuntu users can "apt-get install mingw32"
+- Fedora and SuSE users can download RPM packages from 
+  [http://mirzam.it.vu.nl/mingw/]
+
+You will need to know the prefix used by the cross-compiler executables.
+It will usually be something like "ix86-mingw32*-". All the binaries in the
+cross-compiler package will start with this prefix. The documentation below
+assumes that it is "i386-mingw32-". Adjust to taste.
+
+You will also need to install Wine and the Linux kernel "binary formats"
+(binfmt) support, so that you can run Windows executables on Linux,
+otherwise the configure scripts will not work properly with a cross-compiler.
+On Ubuntu, run:
+
+	apt-get install wine binfmt-support
+	/etc/init.d/binfmt-support start
+
+Start by downloading Zlib from [http://www.zlib.net/], unpack and enter
+source directory:
+
+	export CC=i386-mingw32-gcc 
+	export AR="i386-mingw32-ar rc" 
+	export RANLIB="i386-mingw32-ranlib"
+	./configure
+	make
+	make install prefix=/usr/local/i386-mingw32
+
+Download OpenSSL 0.9.8b from 
+[http://www.openssl.org/source/openssl-0.9.8b.tar.gz]
+
+Unpack and configure:
+
+	tar xzvf openssl-0.9.8b.tar.gz
+	cd openssl-0.9.8b
+	./Configure --prefix=/usr/local/i386-mingw32 mingw
+	make makefile.one
+	wget http://www.boxbackup.org/svn/box/chris/win32/support/openssl-0.9.8b-mingw-cross.patch
+	patch -p1 < openssl-0.9.8b-mingw-cross.patch
+	make -f makefile.one
+	make -f makefile.one install
+
+Download PCRE from 
+[http://prdownloads.sourceforge.net/pcre/pcre-6.3.tar.bz2?download]
+
+Unpack:
+
+	tar xjvf pcre-6.3.tar.bz2
+	cd pcre-6.3
+
+Configure and make:
+
+	export AR=i386-mingw32-ar
+	./configure --host=i386-mingw32 --prefix=/usr/local/i386-mingw32/
+	make winshared
+
+If you get this error:
+
+	./dftables.exe pcre_chartables.c
+	/bin/bash: ./dftables.exe: cannot execute binary file
+	make: *** [pcre_chartables.c] Error 126
+
+then run:
+
+	wine ./dftables.exe pcre_chartables.c
+	make winshared
+
+to complete the build. Finally:
+
+	cp .libs/libpcre.a /usr/local/i386-pc-mingw32/lib
+	cp .libs/libpcreposix.a /usr/local/i386-pc-mingw32/lib
+	cp pcreposix.h /usr/local/i386-pc-mingw32/include
+
+You will need to find a copy of mingwm10.dll that matches your cross-compiler.
+Most MinGW distributions should come with it. On Debian and Ubuntu, for some
+bizarre reason, you'll find it compressed as
+/usr/share/doc/mingw32-runtime/mingwm10.dll.gz, in which case you'll
+have to un-gzip it with "gzip -d". Copy it to a known location, e.g.
+/usr/local/i386-mingw32/bin.
+
+Download and extract Box Backup, and change into the base directory,
+e.g. boxbackup-0.11rc2. Change the path to mingwm10.dll in parcels.txt to
+match where you found or installed it.
+
+Now configure Box with:
+
+	./configure --host=i386-mingw32 \
+		CXXFLAGS="-mthreads -I/usr/local/i386-mingw32/include" \
+		LDFLAGS=" -mthreads -L/usr/local/i386-mingw32/lib" \
+		LIBS="-lcrypto -lws2_32 -lgdi32"
+	make
+
+or, if that fails, try this:
+
+	export CXX="i386-mingw32-g++"
+	export AR=i386-mingw32-ar
+	export RANLIB=i386-mingw32-ranlib
+	export CFLAGS="-mthreads"
+	export CXXFLAGS="-mthreads"
+	export LDFLAGS="-mthreads"
+	export LIBS="-lcrypto -lws2_32 -lgdi32"
+	(if you don't have a "configure" file, run "./bootstrap")
+	./configure --target=i386-mingw32
+	make CXX="$CXX" AR="$AR" RANLIB="$RANLIB" WINDRES="i386-mingw32-windres"

Copied: box/trunk/docs/api-notes/windows_porting.txt (from rev 2474, box/trunk/docs/api-notes/backup/windows_porting.txt)
===================================================================
--- box/trunk/docs/api-notes/windows_porting.txt	                        (rev 0)
+++ box/trunk/docs/api-notes/windows_porting.txt	2009-03-28 15:51:04 UTC (rev 2478)
@@ -0,0 +1,100 @@
+TITLE Notes on porting the backup client to Windows
+
+It should be relatively easy to port the backup client (bbackupd) to Windows. However, the server relies on unlink() behaviour of the UNIX filesystem which is different on the Windows platform, so it will not be so easy to port.
+
+An installation of perl is required to build the system. The ActiveState port is the easiest to install.
+
+
+SUBTITLE Build environment
+
+The build environment generation script, makebuildenv.pl, uses perl to scan all the files and generate makefiles. It's rather orientated towards UNIX systems with gcc.
+
+Probably the easiest way to handle this is to detect the Windows platform, set up a few variables appropriately (in BoxPlatform.pm) and then post-process the generated Makefiles to mould them into something more handy for the MS Windows compiler and toolchain.
+
+The script itself is a bit messy. It was fine at first, but then the multi-platform thing got in the way. I do intend to rewrite it at some point in the future.
+
+Make sure your new version defines PLATFORM_WIN32 on the compile line.
+
+All files #include "Box.h" as the first include file. Use this for pre-compiled headers. Edit BoxPlatform.h to include the Windows headers required, and include a PLATFORM_WIN32 section. The easiest start will be to leave this blank, apart from typedefs for the basic types and any "not supported" #defines you can find.
+
+Boring bits of the code, such as exceptions and protocol definitions, are autogenerated using perl scripts. This code should be portable without modification.
+
+I have tried to avoid the things I know won't work with the MS compiler, so hopefully the code will be fairly clean. However, it might be a little easier to use the MinGW compiler [ http://www.mingw.org/ ] just to be consistent with the UNIX version. But I hope this won't be necessary.
+
+You'll need the latest version of OpenSSL. This was slightly difficult to get to compile last time I tried -- especially if you're determined to use the optimised assembler version. The main difficulty was getting a version which would link properly with the options used in my project, the default libraries selected got in the way.
+
+
+SUBTITLE Porting as UNIX emulation
+
+Since the daemon uses so few UNIX system calls and with a limited set of options, it seems to make sense to port it by writing emulations of these functions. It's probably nicest to create a lib/win32 directory, and populate this with header files corresponding to the UNIX header files used. These just contain inline functions which map the UNIX calls to Win32 calls.
+
+File/socket handles may have to be translated. -1 is used as a failure return value and by the code internally to mark an invalid socket handle. (0 is a valid socket handle)
+
+Of course, some bits of code aren't relevant, so will just be #ifdefed out, or replaced. But this should be minimal. (Only perhaps the small bit relating to filesystem structure -- there aren't really mount points as such.)
+
+
+SUBTITLE File tracking
+
+The daemon uses the inode number of a file to keep track of files and directories, so when they're renamed they can be moved efficiently on the store. Some unique (per filesystem) number will have to be found and used instead.
+
+It uses the Berkeley DB to store these on disc. It's likely another storage system will need to be used. (It just has to map the file's unique number into to a 8 byte struct.)
+
+There is a in-memory implementation for platforms which don't support Berkeley DB, but this isn't so good when the daemon has to be restarted as all the tracking is lost. But it's an easy start.
+
+
+SUBTITLE Expected filesystem behaviour
+
+File and directories have (at least) two modification times, for contents and attributes.
+
+For files, the contents modification time must change when the contents change, and the attributes time when the attributes change (and may change when the contents change too.)
+
+For directories, the contents modification time must change when files or directories are deleted or added. If it changes any more frequently than this, then the client will be slightly less efficient -- it will download the store's directory listing whenever this time changes. The attributes modification time is less important, as the actual attributes are compared and only uploaded if different.
+
+
+SUBTITLE Attributes
+
+Attributes means file modification times, flags, and filesystem permissions.
+
+The BackupClientFileAttribute class will need to be extended. Allocate another "attribute type" for the Win32 attributes, and then serialise it in a compatible way -- put your new attribute type in the header, and then a serialised network byte order structure in the rest. The different size of block is handled for you, and the server never looks inside.
+
+Add code so that under UNIX, Win32 attributes are ignored, and UNIX attributes under Win32.
+
+It's probably not necessary to worry too much about these for the first version. Not many people seem to use these attributes anyway.
+
+
+SUBTITLE Times
+
+The system uses it's own 64 bit time type -- see BoxTime.h. Everything is translated to this from the various different system time types, and calculated and stored internally in this form.
+
+
+SUBTITLE Daemon as a Service
+
+The client is derived from the Daemon class, which implements a daemon. The interface is simple, and it shouldn't be hard to write a compatible class which implements a Windows Service instead.
+
+Or cheat and run it as a Win32 application.
+
+Note that the daemon expects to be able to read every file it wants, and will abort a scan and upload run if it gets an error. The daemon must therefore be run with sufficient privileges. It runs as root under UNIX.
+
+
+SUBTITLE Command Socket
+
+The backup daemon accepts commands from bbackupctl through a UNIX domain socket. When a connection is made, the user ID of the connecting process is checked to see if it's the same user ID as the daemon is running under.
+
+This may not have any exact analogue under Win32, so another communications scheme may have to be devised.
+
+This is only actually necessary if the client is to be run in snapshot mode. It can be safely left unimplemented if snapshot mode is not required, or the prompts for it to sync with the server are implemented some other way.
+
+
+SUBTITLE NTFS streams
+
+If you want to back up NTFS streams, then a generic solution should probably be defined, so that the Mac OS X resource forks can be backed up with the same mechanism.
+
+
+SUBTITLE Source code
+
+I work on a slightly different version of the source files. A make distribution script adds the license header and removes private sections of code. This means submitted diffs need a slight bit of translation.
+
+
+
+
+




More information about the Boxbackup-commit mailing list