DeepMind – Safe Artificial Intelligence – Victoria Krakovna

There are many types of interpretability, from identifying influential features and data points to learning disentangled representations. Which of these are the most relevant for building safe AI systems? We will examine how different safety problems benefit from different types of interpretability, and what questions…