Interfacing with R

Background on R

For the official R language definition, see this document. Much of the information below, however, comes from our experience with the current implementation of R (version 2.8.1), the document on R Internals, and discussions on the R-devel mailing list.

Objects

R provides a number of data structures, called objects, for accessing the data stored in memory. These data objects are referred to by symbols, which are themselves objects. So there are two categories of objects: symbol objects and data objects. As an example, when R executes a <- 1:10, a data object will be created for 1:10 while a symbol object will be created for a.

Binding

Data objects are bound to symbol objects, and R stores this binding information in an internal table. In the a <- 1:10 example, the data object for 1:10 is bound to the symbol object for a. If a[1] is accessed later, R will find the binding for a and retrieve the first element of the associated data object.

Copy on Write

R uses a technique called copy on write to optimize copying data. For example, b <- a should, conceptually, make copy of a and name it b. However, at this point, R does not make any physical copy of the data; instead, it just binds the symbol b to the same data object. A special field inside the data object, named, is set to 2 to indicate that the object is shared by multiple symbols. Later, when a or b is modified, e.g., b[1] <- 0, the named field of the data object is consulted. If its value is 2, R duplicates the data object before modification.

R uses this copy on write trick on function arguments as well. R's functions are call-by-value, and R evaluates function arguments lazily. For example, a vector argument would conceptual behave as if it is a local copy of the input vector. In reality, however, R simply binds the symbol object for this argument to the same data object. R will create a new copy of the data object if the function modifies this argument.

Note that the value of the named field is not a reference counter. Hence, the value 2 simply means that the object is possibly shared; R does not track how many symbol objects share this data object.

To summarize (using an example), the actions caused by the modification b[1] <- 0 are:

  1. Check the named field of the data object to be modified. If named=2, make a copy of the data object, and set named=0 for the copy (the original object's named remains 2).
  2. Modify the first element of the copied data object in place.
  3. Bind the symbol b to the modified data object.

Note: In the current version of R, the named field of the original copy is never be set back to 0 once it becomes 2, even if the data object is no longer shared after copying. Therefore, R might end up making more copies than necessary (though they are harmless from the point of view of correctness). For a specific example, consider:

a <- 1:10
b <- a
b[1] <- 0
a[1] <- 0

Ideally, since b[1] <- 0 already makes a copy of a's underlying data object, it would be safe not to copy it again when executing a[1] <- 0. In the current version of R, however, another copy will be created and associated with a, whereas the original copy will be orphaned.

De-Allocating Data Objects

One might wonder how R decides when to de-allocate a data object. Not just reference counting, but some sort of garbage collection.

Implications for RIOT

Things become more complicated for RIOT, where it is possible for data objects to refer to things externally managed by RIOT. We illustrate the problem using an example. Again, consider the following code:

a <- 1:n
# if n is very large, a will be a RIOT object
# instead of a regular R object
b <- a
b[1] <- 0

The above code works well on regular R objects. The following figure shows the object binding status as R executes the code line by line, as described in the background section above.

symbol_a        symbol_a   symbol_b        symbol_a   symbol_b
   |                \         /               |          |
   |      =>         \       /        =>      |          |
 data_a                data_a               data_a     data_b

RIOT complicates the picture, however. In RIOT, large arrays, matrices, etc., are no longer placed in memory, but stored on disk. For example, In RIOT-DB (an implementation of RIOT using a database backend), array data is stored in external database tables. In-memory R data objects no longer hold the actual data, but only metadata---information on where to locate the database tables. Accesses to the data objects are forwarded to the database tables. The intricacies caused by this one more level of indirection can be illustrated by the following figure.

When b <- a is executed, R's copy-on-write mechanism binds symbol_b to the same data object data_a, which points to the database table. The trouble comes when either a or b is modified, e.g., b[1] <- 0. We cannot use the same implementation for modification described above. Not knowing that data_a actually contains an external reference, this implementation would make an exact (and shallow) copy of data_a, so both data objects will point to the same database table table_a. This shallow copying will lead to wrong behavior: the modification to b will be forwarded to table_a, which causes a to be incorrectly modified.

symbol_a        symbol_a   symbol_b        symbol_a   symbol_b
   |                \         /               |          |
   |                 \       /                |          |
 data_a     =>         data_a        =>     data_a     data_b
   |                     |                     \         / 
   |                     |                      \       /
table_a               table_a                    table_a

The RIOT Solution

From the above discussion it is clear that RIOT must implement in-place modification differently in order to correctly handle copying of data objects with external references. To this end, RIOT provides its own implementations for in-place modification methods for RIOT data types. Specifically, in-place modification methods include subset assignment (e.g., b[1] <- 0) and resize (e.g., dim(M) <- c(5, 5)). Such implementations are readily introduced by the RIOT extension package using R's generics mechanism (more specifically, the setMethod function).

Assuming that R only copies a data object when making in-place modifications (true in the current implementation), RIOT maintains the following invariant: There is a one-to-one correspondence between RIOT data objects and external objects. In other words, there is no "aliasing" at the level of external objects---an external object cannot be shared by two data objects.

As an optimization, when "copying" an external object for an in-place modification, we do not need to literally copy the external object. Instead, the new external object can, for example, be represented as the composition of the old external object and some "delta." In particular, we can represent the modified version of a database table as a database view over the table storing the previous state and a "delta table" logging the modifications. This optimization does introduce dependencies among external objects, which are tracked by RIOT. Such dependencies are completely orthogonal to and should not be confused with reference counting for data objects.

Another issue is the de-allocation of external objects. Since external objects (e.g., tables and views in databases) may need to be explicitly de-allocated, RIOT also implements the finalizer method for RIOT data types. This method is automatically called when R destroys a data object. Inside this method, RIOT deletes the external object associated with the data object to be destroyed. The dependencies introduced by the optimization above, however, means that we might have external objects that cannot be de-allocated because they are still used in defining external objects associated with live data objects. To properly handle de-allocation of external objects, RIOT implements reference counting among external objects. This reference counting is completely orthogonal to and should not be confused with reference counting for data objects.

Finally, as discussed earlier, R does not have reference counting for data objects; the named field is not really a reference counter. As an extension to R, RIOT would inherit this problem---it is possible that RIOT creates more copies of data objects and (therefore) external objects than really necessary. This problem does not affect correctness, but does introduce some inefficiency. To address this issue, RIOT also provides a patch to R that performs exact reference counting to eliminate unnecessary copying of data objects and (therefore) external objects. This patch changes the internals of R, and therefore cannot be done at the package level. Note, however, that this patch is strictly for performance; it is not required for correctness.