Semantically meaningful features

Our default image of a machine learning project is:

  • multi-year moonshot funded by tech giant
  • either internet scale general purpose media interpretation (image identification, language translation, etc.)
  • or sophisticated autonomous robotics (i.e. self-driving cars)
  • large volume of labeled training data
  • confident the training set effectively represents the full distribution of observations we will see in production over extended period
  • training time/cost is not a strong constraint (big network of GPUs/TPUs available as needed)
  • goal is to provide fully automated decision making
  • can apply supervised learning techniques
  • use custom algorithms developed by dedicated team of sophisticated data scientists

With this image in mind it is easy to ignore concerns about whether features are semantically meaningful to a human.  One can take the attitude that this system will be heavily and widely tested and the optimized features that are proven to perform best are the ones I should use even if they have no semantic meaning.  (In this context by “feature” I mean both the features that are fed in and any intermediate features/representations that are created and used internally).

However, the majority of real world machine learning projects are not going to fit the above profile.  They will be Enterprise projects with a very different profile:

  • limited training data that may be a mix of labeled and unlabeled
  • not initially clear if training data represent full distribution of what we will see in production
  • need to apply semi-supervised, unsupervised, reinforcement learning or transfer learning techniques
  • have meaningful cost and resource constraints
  • often a decision support system rather than fully automated decision making

In these circumstances if is often very valuable to have semantically meaningful features which provide better:

  • explainability
  • ability to assess generalizability
  • troubleshooting support
  • ability to learn effectively with less labeled data
  • etc.

We recently came across three examples of work that potentially will increase our use of semantically meaningful features.

SME concepts to seed learning process

The Bonsai platform and Inkling language can leverage subject matter experts’ concept definitions to give a starting point for the learning process.  This reduces training time and resources needed.  But perhaps more importantly it provides much more explainable and manageable models since the models are defined in terms of understood domain concepts.  Video here …

Generative model to discover latent variable model

Stefano Ermon’s Stanford team has demonstrated how to use unlabeled data to discover “latent features” which then enable effective machine learning from a very small labeled training set.  They have done some work around generating “semantically meaningful” latent features with more work to do.  Read more …

Interpretable Machine Learning through Teaching

A team from OpenAI and UC Berkeley have found that by encouraging AIs to teach each other via examples, they can identify the smallest set of examples that are required to teach a concept.  The examples can then be treated as the defining examples for a given concept and can be used as part of the explanation process.   Read more …

Leave a Reply