SourceForge.net Logo

Test Data Generator

Introduction

Often in software projects, it is useful to be able to test a product with realistic data before the product goes into production. This is particularly true for sites that manage large quantities of data, such as in health services, financial or other industries. Artificially generated data may also be useful to the scientific community, for visualizing the output of equations, testing data analysis programs, and other applications. A considerable amount of effort can be expended to create a dataset that is large and complex enough to mimic real world situations. The goal of the current project is to automate the process of generating data, through scripted configuration and customization.

Testdata generates datasets. A dataset is a collection of values that are taken from one or more source collections. There is no restriction on how values are taken. The dataset could contain a subset of values from the source collections, or possibly the same elements repeated many times.

The data in a dataset are internally consistent. For example, a dataset generated for insertion into a database would have foreign key values in some tables that matched primary key values in other tables. To ensure consistency, datasets are built in memory, in a database based on XML and XPath.

A generation is a dataset that is one sample from one or more collections of values. Values within a dataset can be associated in 1:1, 1:N, and M:N relationships. Recursive data structures, such as trees, are generated by associating elements from different generations. Time series and other functions can be implemented with generations.

Testdata does not analyze a problem domain, to generate data that does a better job at testing the domain. Rather, it could be driven by such a tool.

Here is a link to the project page on SourceForge.

Collections

Testdata works by iterating over collections of data. Some of these collections may be concrete (i.e. a set of strings read from a file), and some may be virtual. Virtual collections are used to model collections of numbers, dates, and other datatypes, where the collection doesn't ordinarily contain instances of elements. Standard datatypes (boolean, char, short, int, float, double, string, date, etc.) are supported.

For example:

Script
Generates
<ints/> an unlimited number of integers, numbered starting at 0 and incrementing by 1.
<ints constant="22"/> unlimited numbers of the value 22.
<ints size="10"/> 10 integers, numbered 0-9.
<ints size="10" step="2"/> 5 integers: 0, 2, 4, 6, 8.
<ints first="-10" last="10" step="5"/> 4 integers: -10, -5, 0, 5.

Collections are managed in a tree structure -- a collection can contain a list of collections. When iterating, a pass over the outer collection causes a pass over each of the inner collections. For example:

    <collection size="7"/>
      <ints first="1" step="2"/>
      <ints first="10" step="10"/>
    </collection>
generates 7 pairs of integers: (1, 10), (3, 20), (5, 30), (7, 40), (9, 50), (11, 60), (13, 70). Note that the two inner collections (ints) have attributes that define their values but not their size -- this parameter is provided by the outer collection.

A collection can reference another collection. This is useful for implementing primary/foreign key relationships, for data that is stored in a database. Another use is for applications that need to generate values out of a common collection. For example:

    <ints id="ref1"/>
    <collection size="5"/>
      <ints ref="ref1"/>
      <ints ref="ref1"/>
    </collection>
generates the pairs: (0,1), (2,3), (4,5), (6,7), (8,9).

Associations

Collections are associated with the association element, which contains two or more association ends (end elements). An end contains three elements: a collection, a multiplicity, and a distribution (the last two are also collections). Any of these elements can be a reference to a collection.

Multiplicity has the same meaning as in UML: it describes the numbers of the collection that are connected with another collection. For example,

A[0..3]<->B[1..5]

means:

Distribution describes how instances are selected for association. In general, each association end has a pool of instances to choose from. The pool is an ordered list, whose size is configurable with the end's poolSize attribute. The values generated by distribution are used as indexes into the pool. By default, distribution values range in order from zero to the size of the pool, but it is straightforward to define random or other distributions.

The next example defines an association between movies and actors. Movie titles come from a file named "titles", and actor names from a file named "names". Multiplicities are movies[1..4]<->actors[1..3]: a movie has 1 to 3 actors, and an actor has acted in between 1 and 4 movies. On any given iteration, the value for movie multiplicity is chosen randomly (the maximum value is 4, one less than the value of the max attribute). For any iteration, actor associations are chosen randomly from a pool of 1000.

    <!-- association: movies-actors -->
    <association>

      <!-- movies -->
      <end>
        <strings id="movies" file="titles"/>
	<multiplicity>
          <randoms min="1" max="5"/>
	</multiplicity>
      </end>

      <!-- actors -->
      <end poolSize="1000">
        <strings id="actors" file="names"/>
	<multiplicity>
          <ints first="1" last="3"/>
	</multiplicity>
	<distribution>
          <randoms/>
	</distribution/>
      </end>
    </association>

This example can be rewritten using references:

    <!-- movie collections -->
    <strings id="movies" file="titles"/>
    <randoms id="mmult" min="1" max="5"/>

    <!-- actor collections -->
    <strings id="actors" file="names"/>
    <ints id="amult" first="1" last="3"/>
    <randoms id="adist"/>

    <!-- association: movies-actors -->
    <association>
      <end collectionRef="movies" multiplicityRef="mmult" />
      <end collectionRef="actors" multiplicityRef="amult" distributionRef="adist" poolSize="1000"/>
    </association>

Functions

Operations can be performed on generated values. Several types of unary and binary operations are available in the distribution, including trigonometric functions, and the arithmetic operators addition, subtraction, multiplication, and division. Custom operators can be created.

For example,

    <add size="5">
      <ints first="1" step="2"/>
      <ints first="10" step="10"/>
    </add>
generates 5 integers: 11, 23, 35, 47, 59. Note that the add element takes a size attribute, and so looks like a collection. This is true in general for all types of elements.

In this example,

    <sin>
      <doubles last="2pi" step="0.01"/>
    </sin>
values are generated for sin(x), where 0<=x<PI*2, and x increments in steps of 0.01.

Operators can be nested:

    <add>
      <multiply>
	<ints constant="3"/>
	<ints first="-10" last="10"/>
      </multiply>
      <ints constant="-22"/>
    </add>
generates a straight line with slope 3 and y-intercept -22, whose x values range from -10 to 9. A line operator is provided in the distribution, so this example could be rewritten as:
    <line slope="3" intercept="-22">
      <ints first="-10" last="10"/>
    </line>

Output

Output occurs by default after each iteration. After the iterator for each collection has generated its next value, an output visitor visits each iterator and writes the value to a given stream.

The output data structure is independent of the structure of the collections. As a result, a given set of collections can support many different output structures. The output structure is described by a set of element and attribute nodes, and a container that describes the stream, visitor, etc. For example:

    <!-- output description -->
    <output stream="stdout" visitor="xml">
      <element name="data">
	<attribute name="a1" ref="1"/>
	<attribute name="a2" ref="2"/>
      </element>
    </output>

    <!-- collection -->
    <collection size="5"/>
      <ints id="1" first="1" step="2"/>
      <ints id="2" first="10" step="10"/>
    </collection>
generates:
    <data a1="1" a2="10"/>
    <data a1="3" a2="20"/>
    <data a1="5" a2="30"/>
    <data a1="7" a2="40"/>
    <data a1="9" a2="50"/>
This same collection, with the output description:
    <output stream="stdout" visitor="xml">
      <element name="foo">
	<attribute name="x" ref="1"/>
	<element name="bar">
	  <attribute name="x" ref="2"/>
	</element>
      </element>
    </output>
generates:
    <foo x="1">
      <bar x="10"/>
    </foo>
    <foo x="3">
      <bar x="20"/>
    </foo>
    <foo x="5">
      <bar x="30"/>
    </foo>
    <foo x="7">
      <bar x="40"/>
    </foo>
    <foo x="9">
      <bar x="50"/>
    </foo>

Floating precision number format is supported with java.text.DecimalFormat. Output visitors are provided in the distribution for plain text, comma-separated values (CSV), tab-separated values, and XML formats. A custom stream or visitor can be added by deriving from a base class, and passing the classname in the stream or visitor attribute.