A discussion of the new dependency code, version 0.3 by Michael Meeks and Jody Goldberg 1. Overview of the Dependencies The dependency code is designed to determine which objects depend on which cells, or regions. The main use of this is triggering recomputation for things that depend on something that has changed. The majority of the code related to dependencies can be found in module dependent.c. 1.1 Data structures and their meaning Dependency data is anchored on a per-sheet basis using GnmDepContainer. This stores all the dependencies for the sheet in two hash tables. In the future other objects may also become containers, such as graphs in their own tab. The two main types of dependencies, are single and range. Loosely these describe single cells (ie. = A1) vs. regions (ie. = SUM (A1:Z500)) The range_hash and single_hash tables use DependencyRange and DependencySingle to store lists of Dependents (not necessarily in this container) that depend on things in the container. The range_hash table is hashed on the normalized range, and the other hashes on the GnmCellPos of the singleton. Sheet does not come into play because it is implicit in the container. DependencySingle stores stores the dependents that rely on individual cells. This makes lookup easy. Only 1 DependencySingle will exist for a given cell. We can find that quickly via the hash. DependencyRange cannot be searched via the hash. A target cell may be contained by several regions. Finding which regions requires traversing the list of regions looking for intersections. See below. 2. GnmDependent Evaluation The routine dependent_eval will evaluate a dependent and if its value changes it will queue cells that depend on it, these are obtained via cell_foreach_dep, which looks up in the single_hash and traverses the range_hash to obtain its result. Often cell recalculation happens through workbook_recalc which works through each sheets dependents, re-evaluating any that are marked for calculation. 2.1 Evaluation queue vs. recursion There are two ways in which a cell's recalculation can occur. Firstly simply traversing the GnmExpr (expr.h) will result in many cells being evaluated. Essentially each dereference in the GnmExpr (eg. =A1) will cause the re-calculation of A1's GnmExpr before we can continue (recursively). This is actually fairly expensive when the expression contains a reference like =sum(a1:a65535) 3. Dependencies the bottleneck Since dependencies tend to fan out, and massively increase the area that needs to be recomputed, it is clearly advantageous to spend time culling the dependency tree as intelligently as possible. Furthermore since for each cell that changes it is necessary to determine its dependencies the lookup of a cell's dependencies needs to be fast. 3.1 Why two methods First, consider the case where only range dependencies are used, this is a fairly simple first implementation. If we have N cells that have random dependencies on one other cell, then we will have approx N ranges in the range hash. For each cell we re-calculate we need to iterate over the entire range hash to determine its dependencies. Hence we have a fundamentally O(N^2) algorithm, this is very bad news. This scheme spends almost all of its time in search_cell_deps. To overcome this problem we partition dependencies into single cell dependencies and range dependencies. This way for the common =A1 case, we don't add an entry in the range_hash, we simply add an entry in the simple_hash. Hence for the cell_foreach_dep we have one less entry in the range hash to iterate over, which saves N iterations of search_cell_deps. Another common case is having a lot of formulae sharing the same range in the sheet as a dependency (eg. cutting and pasting = SUM($A$1:$A$100)). In this case there is a single dependency range entry with many cells listed as its dependencies. 3.1.1 bucketing range depends Searching the list of range depends is still expensive if there are a lot of them. eg in the situation where an expression like =sum(b1:d1) is duplicated for the entire 1st column. We end up with a huge number of non-overlapping regions to search through. That can be ameliorated by bucketing the range depends. Giving only a few columns to each bucket. FUTURE : it would be useful to notice that a larger range contains a smaller one, and to build up a tree as necessary. This would also decrease the size of the list. 3.2 Inter-sheet dependencies Inter sheet dependencies are managed simply by inserting the dependency information into the sheet in which the cells that are dependent on reside. This is essentially exactly what is expected, given that cell's are linked to the cell they depend on. Removing inter-sheet dependencies is also identical to normal dependencies, excepting that it is more likely to throw up formulae that have not been correctly unlinked in a sheet destroy. As an optimization we flag inter-sheet and inter-book dependencies. If we are destroying an entire sheet or workbook there is no need to be neat and tidy unlinking individual depends when the whole structure is going to be deleted. 3.3 What is hashed While the two hashes (range_hash, simple_hash) are both GHashTables what is hashed is quite different. 3.3.1 What does the range hash do ? The hashing on the range_hash is merely used to determine if there is already a range in the hash with the same dimensions as a new dependency range being added. This accelerates insertion of dependencies, the hash is traversed as a simple un-ordered list at all other times. 3.3.2 Why not a direct Cell * -> set mapping for DependencySingle ? This is not done since there is no guarantee that cells that have dependencies are in existence yet. Hence it is quite valid for A1 to be '=A2' before A2 exists. If A2 does not exist then A2 has no Cell structure yet. This could be obviated by creating depended cells, but this would be inelegant. 3.3.3 How are depends stored As of 1.1.x we enabled switched from storing lists of dependents in the DependentSingle and DependentRange. That was a bottleneck when ensuring uniqueness if a lot of things depended on the same region directly. The list was replaced with a 'microHash'. A simple hash table that looks like a list for small sizes. It is derived from GHashTable in glib, but simplified to make it lighter weight. 3.4 3D dependencies Some expressions depend on the order of the sheets in a workbook. When linking the expressions we notice this and flag the dependent before adding it to a collection in the workbook. The workbook is responsible of linking and unlinking those dependents when the sheet order changes. 3.5 Named expressions Named expressions are special. In the current implementation we do not differentiate between names that use relative references and those that don't. Names cannot be dependents because relative references change value depending on where they are evaluated. However, if a sheet is removed we need to know to invalidate a name that references it. The 'referencing_names' member of the collection tracks that. 3.6 Dynamic depends Functions like INDEX, INDIRECT and some of the range operators produce new ranges at run time. As a result these are not registered when an dependent is linked. The DynamicDep structure is a special form of GnmDependent that is used to collect and remove the dependencies as they get generated. When a dependent with dynamic depends is evaluated or unlinked the list is cleared. The process of evaluation then repopulates the set. 4. The life cycle of dependency records A GnmDependent is 'linked' if its dependencies on other things have been registered with its container, and 'unlinked' if not. Linking takes place by traversing the expression tree and registering the single, range, and 3d references. 4.3.1 Implicit intersection This is as yet unimplemented, but will further reduce the number of ranges to clip against. Essentially an implicit intersection reduces a range to an adjacent single reference under certain circumstances. 4.3.2 Array Formula These luckily have a simple dependency structure, since the formula stored is identical in each cell, the cells may all depend on the corner cell using a fast single mapping. 5 Future Expansion There are several avenues open for future expansion. 5.1 DependencyRange trees or quad-trees Accelerating the range search will give big speedups on large sheets. The current row based bucketing approach seems to work nicely, but could be augmented by using trees or quad trees to handle containment better. 5.2 Multi-threading, With the current structure, it might well be possible to add multi- threading support to the evaluation engine. The structure of this would take advantage of the partitioning already provided by the sheet boundary. To do this it would be necessary to move the eval_queue to a per-sheet entity, and putting a locking / signaling mechanism on the queue such that inter-sheet dependencies could be prepended to the queue (thus ensuring rapid evaluation), and waited on safely. Since each cell is evaluated but once per re-calc, it would then be safe to read the Cell's value when it dissapeared from the eval_queue. 5.3 GnmExpr recursion Whether it is always entirely necessary to re-evaluate a cell solely on the basis that it is in the GnmExpr is non-obvious to me. Clearly if this cell is in the dependency queue it would make perfect sense, however if there is as yet no chance that this cell has been changed, it makes little sense to re-calculate it (and its tree'd dependencies). The only problem here is determining whether any of the currently queued dependencies would alter this cell's dependencies.