A generalizability measure for program synthesis with genetic programming

The generalizability of programs synthesized by genetic programming (GP) to unseen test cases is one of the main challenges of GP-based program synthesis. Recent work showed that increasing the amount of training data improves the generalizability of the programs synthesized by GP. However, generating training data is usually an expensive task as the output value for every training case must be calculated manually by the user. Therefore, this work suggests an approximation of the expected generalization ability of solution candidates found by GP. To obtain candidate solutions that all solve the training cases, but are structurally different, a GP run is not stopped after the first solution is found that solves all training instances but search continues for more generations. For all found candidate solutions (solving all training cases), we calculate the behavioral vector for a set of randomly generated additional inputs. The proportion of the number of different found candidate solutions generating the same behavioral vector with highest frequency compared to all other found candidate solutions with different behavior can serve as an approximation for the generalizability of the found solutions. The paper presents experimental results for a number of standard program synthesis problems confirming the high prediction accuracy.