For the official R language definition, see this document. Much of the information below, however, comes from our experience with the current implementation of R (version 2.8.1), the document on R Internals, and discussions on the R-devel mailing list.
R provides a number of data structures, called objects, for accessing the data stored in memory. These data objects are referred to by symbols, which are themselves objects. So there are two categories of objects: symbol objects and data objects. As an example, when R executes
Data objects are bound to symbol objects, and R stores this binding information in an internal table. In the
R uses a technique called copy on write to optimize copying data. For example,
R uses this copy on write trick on function arguments as well. R's functions are call-by-value, and R evaluates function arguments lazily. For example, a vector argument would conceptual behave as if it is a local copy of the input vector. In reality, however, R simply binds the symbol object for this argument to the same data object. R will create a new copy of the data object if the function modifies this argument.
Note that the value of the
To summarize (using an example), the actions caused by the modification
Note: In the current version of R, the
a <- 1:10 b <- a b <- 0 a <- 0
One might wonder how R decides when to de-allocate a data object. Not just reference counting, but some sort of garbage collection.
Things become more complicated for RIOT, where it is possible for data objects to refer to things externally managed by RIOT. We illustrate the problem using an example. Again, consider the following code:
a <- 1:n # if n is very large, a will be a RIOT object # instead of a regular R object b <- a b <- 0
The above code works well on regular R objects. The following figure shows the object binding status as R executes the code line by line, as described in the background section above.
symbol_a symbol_a symbol_b symbol_a symbol_b | \ / | | | => \ / => | | data_a data_a data_a data_b
RIOT complicates the picture, however. In RIOT, large arrays, matrices, etc., are no longer placed in memory, but stored on disk. For example, In RIOT-DB (an implementation of RIOT using a database backend), array data is stored in external database tables. In-memory R data objects no longer hold the actual data, but only metadata---information on where to locate the database tables. Accesses to the data objects are forwarded to the database tables. The intricacies caused by this one more level of indirection can be illustrated by the following figure.
symbol_a symbol_a symbol_b symbol_a symbol_b | \ / | | | \ / | | data_a => data_a => data_a data_b | | \ / | | \ / table_a table_a table_a
From the above discussion it is clear that RIOT must implement
in-place modification differently in order to correctly handle copying
of data objects with external references. To this end, RIOT provides
its own implementations for in-place modification methods for RIOT data
types. Specifically, in-place modification methods include subset assignment (e.g.,
Assuming that R only copies a data object when making in-place modifications (true in the current implementation), RIOT maintains the following invariant: There is a one-to-one correspondence between RIOT data objects and external objects. In other words, there is no "aliasing" at the level of external objects---an external object cannot be shared by two data objects.
As an optimization, when "copying" an external object for an in-place modification, we do not need to literally copy the external object. Instead, the new external object can, for example, be represented as the composition of the old external object and some "delta." In particular, we can represent the modified version of a database table as a database view over the table storing the previous state and a "delta table" logging the modifications. This optimization does introduce dependencies among external objects, which are tracked by RIOT. Such dependencies are completely orthogonal to and should not be confused with reference counting for data objects.
Another issue is the de-allocation of external objects. Since external objects (e.g., tables and views in databases) may need to be explicitly de-allocated, RIOT also implements the finalizer method for RIOT data types. This method is automatically called when R destroys a data object. Inside this method, RIOT deletes the external object associated with the data object to be destroyed. The dependencies introduced by the optimization above, however, means that we might have external objects that cannot be de-allocated because they are still used in defining external objects associated with live data objects. To properly handle de-allocation of external objects, RIOT implements reference counting among external objects. This reference counting is completely orthogonal to and should not be confused with reference counting for data objects.
Finally, as discussed earlier, R does not have reference counting for data objects; the