[Box Backup-dev] Future work

Sun Feb 26 19:20:01 GMT 2006

On 25 Feb 2006, at 13:11, Martin Ebourne wrote:

> On Fri, 2006-02-24 at 14:37 +0000, Ben Summers wrote:
>> I have written up some notes about how I'd like to see the project
>> develop:
>>
>>     http://boxbackup.hostworks.ca/index.php/0.20_redesign
>>
>> Comments welcomed!
>
> File selection
> - Agreed
>
> Upload engine
> - Agreed
>
> Streams in files
> - I prefer properly supporting multiple streams, although I'm not
> against the multiple files idea if we can demonstrate it is
> significantly simpler.

I thought it best to support streams "properly". Whether they're done  
as multiple references to files within a single entry, multiple  
entries, or streams within the file, is a difficult choice. Multiple  
entries is a bit nasty though.

[snip]
>
>
>         Filenames are no longer encrypted as separate objects, as this
>         was shown to be a pain and annoying inefficient. Instead, the
>         directory is stored as a single encrypted stream. This is
>         possible as the store no longer needs to modify them. However,
>         the referred object IDs need to be stored in the clear so the
>         server can use them, but of course, these are included in the
>         signed data so the server can't modify them without detection.
> - Mostly agreed, though I'd be concerned about the use case where  
> there
> is a directory with many entries which has small changes to it. Think
> news spool, Maildir directories, etc. Ideally wouldn't have to  
> reupload
> the whole thing on every change.

Good point -- maybe difficult to achieve though.

I wonder if there's a neat way to do this by using the existing  
diffing infrastructure for the files themselves? Or whether this just  
becomes nasty.

>
>         Attributes are only stored in the directory, never in the  
> file.
>         (Although, what about xattrs, which could be "big"? But
>         relevant, because they can store ACLs.)
> - Attributes includes things such as mtime and size?

Yes.

> Presumably we don't
> back up atime?

For completeness, I think we do.

> I'd be concerned that we'd upload the whole directory
> object just for one frequently changing file.

Yes...

>
>         Each time bbackupd connects, it makes a new backup set, marked
>         with the current time. Rules are specified to the server (or
>         client?) to say when a backup set can be automatically  
> deleted.
>         By default bbackupquery will use the latest backup set, but  
> can
>         be instructed to use a different one.
> - Don't really like this. Doesn't sound like it works well for lazy
> mode, which you confirmed below.
>
> Lazy mode
>         This was the only mode the original supported. It has it's  
> pros
>         and cons. The original decision was made to ensure a slow
>         trickle of data across a broadband connection.
>         The above may make supporting lazy mode a bit tricky, and we
>         should decide whether we want to keep it, and if so, in what
>         form. For example, directory entries may need to be marked as
>         "changed but not included", prompting the restore to look in
>         future versions.
> - Lazy mode is essential for me. It's my favourite feature of Box  
> and is
> what makes it ideal for backing up over a DSL link. With the  
> addition of
> inotify (or equivalent file notification support) and restart  
> resume it
> will finally work perfectly. I can't see the point in going to all the
> trouble of adding inotify support if we then just end up batching  
> it up
> at the other end of bbackupd instead!
>
> Rather than making it messy, I think we'd be better off coming up  
> with a
> design whereby lazy mode worked very well. I was thinking of something
> like your grafts, but each would have a time range. ie. When you fetch
> an object it has multiple possibilities, all with a valid from/ 
> valid to
> range.
>
> You could go into the store and enter any time as view time and  
> then for
> every object you saw you'd get the version valid at that moment.
>
> Snapshot mode would be identical, with the addition that every time it
> did an upload it would record the time in a list. Then it would be
> simple to select the view time from the list of snapshots.

I've been thinking about this (which is why didn't rely very quickly)  
and it seems like there are a few conflicting design goals here.

Basically, the snapshots and the lazy mode stuff conflict, and make  
the reference counting store more difficult. To do them both well  
seems to require a conflicting infrastructure.

I suppose you could have a sort of transient "current" hierarchy,  
which records the lazy uploads and work in progress of the snapshots,  
then the client "solidifies" it at intervals into a snapshot of the  
directory structure, which is signed and everything.

Maybe we could keep the existing directory entries, and then have  
separate signed versions for the backup sets? We could still go  
reference counted with this.

Or we could simply make backups sets as cheap as possible, and not  
worry about it. Lazy mode just means you get one set every hour.

>
>> Also, there's a mini-project suitable for one developer to do
>> independently of everything else, making the raidfile support better
>> and efficient in a cluster of three store servers:
>>
>>    http://boxbackup.hostworks.ca/index.php/Raidfile_improvements
>>
>> Anyone up for it?
>
> I'm not convinced of the advantage of building in raid support. I  
> prefer
> to backup to two stores on different machines rather than raid.
> Obviously for companies providing the service raid would be well worth
> having, but raid is so available these days that you can guarantee
> they've got it already.

It's not wonderful on all platforms, though.

>
> Maybe the three server cluster is a bonus, but there might be a much
> less complicated way of achieving that rather than using raid.
>
>
> What I'm up for:
> - inotify support in linux, as I already stated. Hence I'd be happy to
> join in with that section of redesign to abstract the file searching
> interface out. If we decide this would also benefit from a db backend
> then I guess I'll help on that too.

Good!

>
> - Switching to ostreams to remove the hundreds of warnings and  
> increase
> robustness. This will probably also include changing the logging as
> Chris has already suggested. Ideally this should be in 0.11.

Ah yes, that would be a nice thing to have for the 0.11 tidy-up  
release. Also helps Win32, I think, with type safety.

>
> - Making the underlying box libraries into shared libs so that other
> projects (eg. boxi) can use them easily without having to  
> bastardise the
> box source. Other possible build things such as supporting PREFIX etc
> properly.

That might make using it as a generic framework a bit nicer.

>
> - Changing the store to allow simple point-in-time retrieval, as
> discussed above. This could be quite a big job though and like  
> everyone
> else I don't have a whole lot of time, but I'll try and help out with
> this one.

I think this one just needs a cunning plan. It's probably possible  
now, with clever algorithms to select which is the current file, but  
messy.

But signed directories would be a really big win.

Interesting that noone has posted any thoughts on the licensing  
question.

Ben