Blog

Testing for Machine Learning

Testing for Machine Learning

In machine learning-based applications, we have to deal with more tests other than unit testing, integration testing or regression testing. Data debugging, feature selection, model debugging, and model optimization become a more important part of the development life-cycle. Finally, any software system which uses machine learning models has other modules that must be integrated correctly and tested. It must be ensured that the model gets the proper input from the system and the output is properly processed.

Testing for Machine Learning

Before going to ML testing, let's discuss software testing.

Software testing is a process, to evaluate the functionality of a software application with an intent to find whether the developed software met the specified requirements or not and to identify the defects to ensure that the product is defect free in order to produce the quality product.
Software testing also helps to identify errors, gaps or missing requirements in contrary to the actual requirements. It can be either done manually or using automated tools. Some prefer saying Software testing as a White Box and Black Box Testing.

We will look into Functional Testing: Unit Testing, Integration Testing.

Unit testing vs Integration testing

Unit testing is a form of test to check whether the little piece of software does what it's meant to do. Integration testing is a form of test to check how different parts of the modules fit together.
Unit testing checks a single component of an application. During integration analysis, the action of integration modules is considered.
Unit testing scope is narrow, covering the unit or small piece of code being tested. Therefore, shorter codes are used to address only one class when writing a unit report. The range for integration testing is broad, it covers the entire system being tested, so working together takes much more time.
Python has it's own unit testing library called unittest. A simple unittest looks like this -

import unittest
class TestStringMethods(unittest.TestCase):
def setUp(self):
pass
# Returns True if the string contains 4 a.
def test_1(self):
self.assertEqual( 'a'*4, 'aaaa')
# Returns True if the string is in upper case.
def test_2(self):
self.assertEqual('foo'.upper(), 'FOO')
# Returns TRUE if the string is in uppercase
# else returns False.
def test_3(self):
self.assertTrue('FOO'.isupper())
self.assertFalse('Foo'.isupper())
# Returns true if the string is stripped and
# matches the given output.
def test_4(self):
s = 'aunittest'
self.assertEqual(s.strip('a'), 'unittest')
if __name__ == '__main__':
unittest.main()
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s
OK


Integration checks are the test class which verifies that inside the clock several moving parts and gears fit together well. Where the gears are tested by unit checks, integration tests look at the hands ' position to determine whether the clock will tell the time correctly. We look at the whole system or its subsystems. Usually, integration tests work conceptually at a higher level than unit tests. Creating compatibility checks therefore also happens at a higher level. Sometimes the exact behavior of an integration experiment can not be determined in advance, particularly in probabilistic or stochastic codes. In these cases, the validation of normal behavior is appropriate for integration trials.
There is also regression testing. Regression tests differ in quality from unit tests as well as integration tests. Regression tests look to the past for the expected behavior rather than assuming that the test author knows what the expected result should be. The estimated outcome is taken for the same outputs as what was previously measured. Regression testing suggests the experience is "right." It's useful to let developers know when and how a base of software has changed. They're not great to let anyone know why the change took place. The difference between what a program actually outputs and what it has historically produced is called a regression.
In machine learning based applications, we have to deal with more tests other than unit testing, integration testing or regression testing. Data debugging, feature selection, model debugging, and model optimization becomes more important part of the development life-cycle.
Data of low quality will have a huge impact on the performance of your project. Detecting low-quality data at input is much better than guessing about its presence after poorly forecasting the model. Follow the advice in this section to track your results.

  • You should continually check your data against the expected statistical values in order to monitor your data by writing rules that the data must meet. Such law set is called an information scheme. Through following these steps, create a data schema: consider the context and distribution for your function data. Comprehend the set of possible values of categorical functions.
  • Encode your interpretation in the rules set out in the schema.
  • Check the information against the structure of software. The schema will capture software errors such as: deviations unexplained data distribution values of categorical variables.

Your input data must be fairly reflective of your evaluation and learning breaks. If there are statistically different test and learning differences, otherwise training data will not help predict the test data. See the Sampling and Splitting Data section in the ML course Data Preparation and Feature Engineering to learn how to sample and split data. Track the splits ' mathematical properties. If there are divergent properties, raise a flag. In addition, check that the number of examples remains constant in each break. For eg, this ratio should not shift if your data is split 80:20.


Machine Learning System: C/I, C/D. Courtesy: Wikipedia

Although your raw data may be true, only engineered function data is seen in your template. Due to the fact that engineered data looks somewhat different from raw data content, you need to independently test engineered data. Write unit tests based on your interpretation of your results. For example, to check for the following conditions, you can write unit tests: such as,
  • All numerical features are scaled between 0 and 1.
  • One-hot encrypted vectors contain only one zero for 1 and N-1.
  • Average or standard values are replaced by missing data.
  • Distributions of information following conversion are supposed to be reliable. For eg, if you used z-scores to normalize, the z-scores average is 0.
  • Data Debugging is the first step in debugging the template. Follow these steps to keep debugging your prototype after debugging your data, outlined in the following sections:
  • Verify that the labels can be predicted by the information.
  • Set up a baseline.
  • Write tests and run them.
  • Adjust the values of your hyperparameter.

Finally, any software system which uses machine learning models have other modules which must be integrated correctly and tested. It must be ensured that the model gets the proper input from the system and the output is properly processed.