Test Data Generation

Version 1.1

March 2006

Geoffrey Slinker

Accesses: 
Maverick Development

Introduction

Test data generation is a core part of any development process. It moves to the forefront if the development process is based on a TDD (Test Driven Development) or Agile (a unit test first) methodology.

Tests require mocked objects, stubs, and test harnesses. These "replacement" parts are trusted to behave correctly and often are referred to as reference implementations or reference objects.

Data is needed to drive the test. Unit tests and contract tests can be used in a reference/mock environment or be used in "full stack" or "real" configuration testing. Reuse of tests is a desired goal because of the large amount of effort needed to create accurate test code, reference objects, test harnesses, and data generators.

Test Data

Test data is generated to exercise a specific code path. Tests have an expected outcome. If the test is to show that the system doesn't accept malformed data then the call is expected to fail and if a failure occurs the test passes. If the test is to show that the system returns a specific value on well-formed data and the expected value is not returned then the test fails. Both sets of data (the malformed and the well-formed) are reference sets of data that are used to get expected results.

Test data is a subset of a reference data set. If the reference data set is very large then the test data is a proper subset of the reference data set. Membership in the reference data set is defined by some criteria. If the criteria for membership in the reference set changes then the test data must be updated to maintain membership in the test data subset.

Unit tests call mock objects. I propose that they call a subset of Mock objects called Reference Objects. Reference objects operate with test data (remember test data is a subset of reference data).

An example product is a database that stores a person. A person has a name. Once the person is persisted into the system the person also has an id.

Suppose a unit test calls the mock persistence layer and the person is returned with an id = 5. This will be referred to as test one. The returned person is serialized and saved to drive other tests.

Another test (test two) is written and the test is linked to the real persistence layer. This test uses the serialized results of test one to drive the test. This test queries the persistence manager for a person with id = 5. The real persistence layer doesn't have a person with id = 5 and the persistence manager returns something like ID_NOT_FOUND. The test expected the data to be found by the persistence layer.

Test one's data is not valid in the context of test two and the real persistence manager. The problem stems from the generation of the person data. The mock object existed before the real object. Often this will be the case. A returned object from the mock persistence layer was serialized for use to drive other tests. The person data returned from the mock object is not correct in that it is not "reference" data. A mock object is a reference object when it behaves correctly and the same as the real object. The mock object will be limited in the amount of data it may work with (a hash table of 15 persons where the real object will be a DB with 100,000 person entries). The limited data in the mock object should be correct. If correct values cannot be known until the real object is implemented then the mock object will be refactored and changed to return reference data as soon as it is known how to do so and then the mock object is finally a reference object.

Since the data was generated programmatically from a Mock object (in test one) the problem exists on how to change the mock data to reference data when the definition of reference data emerges or changes.

Possible solutions to this problem are:

1)Don't serialize values that are used for queries such as IDs. Write the test to generate the person on the fly and with the returned person use that id to test the system (based on my person example).

2)Edit the test data to have correct values if the definition of correctness changes or is finally specified. This means refactor test data generation as the real behavior of the system emerges or changes.

3)Populate the real data stores with values identical to the test data.

4)Populate the test data with values from the real data store.

Solution 1 should always be considered. If you are testing a method that gets an object by ID then first let the system create the object and persist it and return an ID. Any value that is system generated or that is nondeterministic should not be persisted for testing if at all possible.

Solution 2 should be done. If the definition for correct or well-formed data changes then the test data must be updated. If the test data had the ID's for persons sequential from 1 to 10 and the Oracle implementation generated ID's for the persons from 14567 to 14577 then update the test data to be correct. Don't try to force the mock ID into the system thus converting the mock value to a reference value. Always convert the mock value to a system value, never change the system value to match the mock value in order for the mock value to be correct.

Solution 3 should be avoided. This is the same as making the system value the same as the mock value. The mock value is driven by the correct system values and not the other way around.

Solution 4 is the "opposite" of solution 3. The system value drives the mock value. If a value is inserted into a database to initialize the database with some valid entries then of course those values had better be valid. Valid system values are then reproduced in the mock data.

I think that best solution to the problem is to create mock objects that become reference objects as the real behavior emerges. The data returned from mock objects will be changed to match the real/correct data, thus becoming test data (a subset of reference data). Data stores for the system will be populated with reference data necessary to test the component and the system. Reference objects will operate on test data(a subset of the reference data).

An example is this:

NamedPersons is the set of all people with a name.

The real system is initialized with 1000 members of the NamedPersons set.

The reference object is initialized with 15 members of the real system's set.

As mock data evolves into test data changes have to occur to the data. Therefore the data should be easy to regenerate or change. There are many ways to meet either of these criteria. To regenerate the data there could be a description of the data to be generated that feeds the data generator. To change the data it could be done with a text editor and a script if the data is in a format such as XML or a properties file. I am not proposing the solution for the format of the data (XML, properties, serialized objects), that is not the problem this paper addresses. The problem of this paper is that of mock data, reference data, test data, and real data and how to generate data so that tests can be reused to do unit tests, contract tests, and full stack tests. There shouldn't be data specific to each layer of a system or for each object in the system (mock or real).

Conclusion

Test data that is generated can work for unit tests, component tests, contract tests and any test of the system if the data meets the definition of reference data. Mocked data stores will hold a duplicated subset of the system data store. The system data store will hold a subset of the reference data defined as the set of all data that met the business rules of the system. By meeting this criteria the mock data becomes a subset of the reference data (test data) and the mock objects that use the test data become reference objects. As the data emerges differences will be refactored or applied to the mock data so that the mock data continually meets the definition of being a member of the subset of the reference data.

Appendix A

There are two types of data in a software system.

Business Data

System Data

 

Business data and System data can exist in the following forms:

Reference Data

Actual Data

Mock Data

 

Business Data is data that is part of the business the software addresses. An example would be a Person Data Object in a business that models a person. There are business rules that define business data.

System data is created by the system for internal usages. This data could be a container implementation or maybe a wrapper around a kernel resource. System data is not part of the business data but is necessary for computer processing and software implementations. Meta data is often system data. There are constraints that define system data.

Business and System data can be represented in the following three forms:

Reference Data

Actual Data

Mock Data

 

Reference data is data that conforms to the validation constraints of that data. An example would be a person can have a birth date and a death date and the death date can not be before the birth date. All persons meeting the criteria are in the reference data set. There could be reference data set for testing errors and exceptions. Reference data for testing the business rules could use a person with a death date before the birth date. All of the persons with death before birth are in the reference data set of invalid persons.

Actual data is the data that is used by the system.

Mock data is data that is used to drive the system for such things as testing.

I propose that the use of Mock data cease and that it is replaced with Test Data.

With this proposal we end up with:

Business Data and System Data

Reference Data

Actual Data

Test Data

Test Data is defined as a finite subset of Reference Data.

 

So we end up with:

Business Reference Data

Business Actual Data

Business Test Data

System Reference Data

System Actual Data

System Test Data

 

Finally there are two types of Reference Data.

Well-formed Reference Data (sometimes called Correct Reference Data)

Malformed Reference Data

Well-formed Reference Data is data that should not produce errors or exceptions. Well-formed data meets business rules, meets assertions, and meets preconditions.

Malformed Reference Data is data that is not well-formed. It is data that will produce errors and exceptions. It does not meet all of the business rules, nor all of the assertions, nor all of the preconditions. The set of malformed reference data has rules for membership in the set. One example would be all persons that have a death date before their birth date.

Actual Data that is persisted in the "real" system should never be malformed. For example, in a database that is populated with values in order to facilitate testing, the database should represent well-formed data, just as the actual deployed and shipped system hopes to do. Mal-formed data must be stored outside of the real system. The real system is designed to store correct data. Nobody wants a database with 300 invalid persons in it. Imagine if your test data actually got shipped by accident.