What Cloudera is missing about automating decisions

Screenshot 2018-03-14 09.06.05.png

This is a big opportunity for clients to automate decisions. In the ’90s and ’00s, the industry wrapped software around business processes. Over the next two decades we’re going to wrap software around decisions, automating bets people are going to make. … I do think this idea that we’re automating capabilities that used to be gut-driven is going to be a fundamental transition for our enterprise clients. Making decisions based on data and spotting patterns that are invisible to human eye will be the way successful companies execute.
— Cloudera founder Mike Olson: ‘We’re moving from automating processes to automating decisions’

Expanding our enterprise system development from a focus on automating processes to add a focus on more efficiently making effective data driven decisions is a sea change in how we operate.  This article is correct in recognizing the importance of this change.

However, what is missing in this discussion is a recognition that for complex consequential decisions most often we won’t fully automate those decisions.  Rather we will build Human-AI hybrid systems, where the AI is providing support to the human decision maker.  We are starting to see such systems appear for a wide variety of knowledge workers.  Everyone from journalists to lawyers to Airbnb hosts to nuclear sub captains.

This type of hybrid decision making inherently requires the output of the AI models to be human interpretable.  If the AI assistant is a black box than the human who is making the final decision won’t know how to integrate and leverage that assistance.

AI Decision Support – Journalist Edition

Screenshot 2018-03-12 14.07.31.png

Reuters is building an AI tool to help journalists analyse data, suggest story ideas, and even write some sentences, aiming not to replace reporters but instead augment them with a digital data scientist-cum-copywriting assistant.

… the aim is to divvy up editorial work into what machines do best (such as chew through data and spot patterns), and what human editorial staff excel at (such as asking questions, judging importance, understanding context and — presumably — drinking excessive amounts of coffee).

That differs from previous editorial tech efforts that sought to train AI to write entire stories …

The system will churn through massive datasets, looking for anything interesting: a fast moving stock price, intriguing changes in a market, or subtler patterns. Journalists are handed that information however they choose — in an email, messenger service, or via their data terminals when they sit down for a shift — alongside key context and background to help jumpstart their research if they think the story is worth pursuing. They can also enter a particular company into the system to get a quick overview, handy for background research and interview preparation.
Reuters is taking a big gamble on AI-supported journalism

We will see more and more of this sort of Human/AI hybrid approach to knowledge work as enterprises move to take full advantage of machine learning’s potential.   Whether it is for journalists, lawyers , Airbnb hosts or nuclear sub captains.

This approach inherently requires explainability as the human half of the hybrid needs to understand the work being done by her AI partner.

Airbnb pricing – decision support and explainability

Screenshot 2018-02-26 17.20.48

Airbnb uses machine learning to help hosts optimize their pricing, which generates more revenue for both hosts and Airbnb itself.

This presentation by Amber Cartwright describes the design process Airbnb went through to deliver a successful system.  Two takeaways to note here:

Decision support rather than automated decision making

From the presentation it is clear that a thoughtful process went into designing a price optimization system. Notice that system does not use machine learning to fully automate the decision making process but rather is a decision support system that encourages hosts to make better pricing decisions while decreasing the time and effort required to achieve those results.

Initially Airbnb created a switch for their hosts that allowed the algorithm to automatically set prices for hosts’ units. They found that hosts were uncomfortable with giving up full control.  Therefore, the team modified the design to add guardrails  — minimum rent allowed and maximum rent allowed. They also added a setting that allowed hosts to set the general frequency of rentals (essentially low, medium, high but in more host-friendly language).

It is natural for data scientists to gravitate towards automated decision making, it reduces the number of variables and emphasises their contribution.  However, as machine learning disseminates through our enterprises we can expect to see a large percent of those projects being decision support systems, which will require different methods and tools from pure automated decision making.  This will be particularly true as we apply machine learning to more consequential decisions.

Explainability is key to deliver system utility

Note that the machine learning model is just one part of the overall system.  The UX design, the traditional software the majority of the system, and the human judgements of the hosts are all equally important parts of the overall system.

The machine learning algorithm applied by the team is not the most sophisticated available.  Which means looking at the algo in isolation it might appear to underperform relative to alternatives.  However, one benefit of the chosen algo is straightforward explainability, which enabled a successful UX design which in turn powered adoption by the hosts and integration in the decision making of info that is available to hosts but is not available to the algo.  Having a more powerful algo that leads to an inscrutable UX and low adoption would deliver model metrics but not enterprise utility.

AI Decision Support – Lawyer Edition

Screenshot 2018-02-28 10.46.04.png

More progress in moving AI to contribute to more complex and consequential tasks:

Competitors were given four hours to review five non-disclosure agreements (NDAs) and identify 30 legal issues, including arbitration, confidentiality of relationship, and indemnification. They were scored by how accurately they identified each issue.  Unfortunately for humanity, we lost the competition — badly.  The human lawyers achieved, on average, an 85 percent accuracy rate, while the AI achieved 95 percent accuracy. The AI also completed the task in 26 seconds, while the human lawyers took 92 minutes on average.
An AI just beat top lawyers at their own game

Note that the application here is really a decision support system rather than taking fully automated decision and actions.  The AI will guide the lawyer to what parts of an agreement need attention, but the lawyer will need to confirm if those are real issues and navigate the negotiation process required to get to an agreement.

These sorts of decision support systems generally require explainability of results.  In this case if the AI highlights a contract clause as problematic but does not explain why it is problematic than it is delivering only a fraction of the potential value.

“Performance > Interpretability”

Christoph Molnar has good thread refuting the most common arguments against implementing interpretable machine learning.   I recommend reading the whole thing.

In this post we will just focus on one aspect of his thread:

Screenshot 2018-02-25 08.02.16

Christoph suggests several circumstances where it is a mistake to sacrifice interpretability to improve model performance metrics.  Of course there is only so much you can fit in a tweet so his list is incomplete.  However, it is a good place to start, let’s consider each of his points:

Can’t capture 100% of your problem’s definition in a single loss function

Having a calculable loss function is important to making machine learning algorithms practical.  Minimizing the loss function is a common model performance metric.  Tuning to achieve this minimization is a standard part of developing a model.

However, in general it doesn’t make sense to sacrifice interpretability to achieve loss function optimization.  Because the loss function does not fully capture the upside value creation and downside risks of applying the model.  In most use cases, explainability can contribute to the value creation and minimize the risks in ways that are not reflected in the loss function.  This is particularly true when dealing with messy human systems such as business and healthcare use cases.

Keep in mind that an ML model is typically only one sub-system in a bigger process which might include other models, traditional software components and human processes.  Optimizing that one subsystems loss function is not equivalent to optimizing the result for the entire system.

Also keep in mind that our loss function is typically only an approximation and simplification of the underlying reality we are modeling.

We need to think beyond model metrics  and remember that enterprise utility outweighs model performance.

The training data is imperfect

Multiple ways that your training data can be lacking.  It might not accurately reflect the full range of variability we will see when applying our model.  It may have leakage from the target variable.  It may not have features present to maximize generalizability.  Etc. etc.

In almost all these cases, explainability helps us recognize and correct the nature of the training data limitations.

Care to learn something about the problem

Explainability helps us make connections between the patterns found by machine learning and our causal insights into the real world systems we are modeling.  Making connections between models and insights is a powerful way to generate value from our projects.

The search for spy planes teaches us about AI explainability, generalizability and troubleshooting

Screenshot 2018-02-25 11.42.55

Can you automate the recognition of a surveillance plane by its flight path?  With  machine learning yes you can.  Understanding how BuzzFeed News accomplished this makes a fascinating case study and provides lessons in how explainability provides collateral benefits such as broader generalizability and easier troubleshooting.

Here is a summary of how they did it:

First we made a series of calculations to describe the flight characteristics of almost 20,000 planes in the four months of Flightradar24 data: their turning rates, speeds and altitudes flown, the areas of rectangles drawn around each flight path, and the flights’ durations. We also included information on the manufacturer and model of each aircraft, and the four-digit squawk codes emitted by the planes’ transponders.

Then we turned to an algorithm called the “random forest,” training it to distinguish between the characteristics of two groups of planes: almost 100 previously identified FBI and DHS planes, and 500 randomly selected aircraft.

… We then used its model to assess all of the planes, calculating a probability that each aircraft was a match for those flown by the FBI and DHS.

… The algorithm was not infallible: Among other candidates, it flagged several skydiving operations that circled in a relatively small area, much like a typical surveillance aircraft. But as an initial screen for candidate spy planes, it proved very effective.
—  BuzzFeed News Trained A Computer To Search For Hidden Spy Planes

They also shared their notes with more technical details including their explainability analysis:

Screenshot 2018-02-25 14.06.08

MeanDecreaseAccuracy measures the overall decrease in accuracy of the model if each variable is removed. MeanDecreaseGini measures the extent to which each variable plays a role in partitioning the data into the defined classes.

So these two charts show that the steer1 and steer2 variables, quantifying the frequency of turning hard to the left, and squawk_1, the most common squawk code broadcast by a plane’s transponder, were the most important to the model.
—  github notes for BuzzFeed News Trained A Computer To Search For Hidden Spy Planes

Thinking about this case study from an XAI perspective some takeaways come to mind.

Sometimes can correct categorization errors without explainability

In their initial analysis there was a consistent categorization error that was noticed through inspection of the results.  Planes used for skydiving were being categorized as surveillance planes.

We suspect that this error was found through spot checking of results and explainability had no particular role in finding.  Also, we suspect this issue could be corrected without the assistance of explainability.  For example, by taking a list of skydiving drop zones and generating a feature that indicated if a plane repeatedly traveled near a dropzone in a single flight.  However, just making this one correction would be missing an opportunity to more generally improve the model.

Explainability allows you to better project generalizability and improve feature engineering

Without explainability it is difficult to project additional insights about generalizability.  In this particular case by looking at the explanations of the incorrect skydiving categorizations it becomes clear that the current features are insufficient to achieve good generalizability and that a fix specific to skydiving operations won’t solve the whole issue.

By understanding via the explanation that circling is the key feature being used, we can also project that we might incorrectly categorize other planes that circle, such as tour planes or crop-dusting planes.  These other types of potential categorization errors may not have been uncovered in our limited (700 observation) labeled training set or identified in our manual spot checks.

The fix for correcting these categorization errors is most likely additional feature engineering.  However, notice that a feature engineering fix for the error type that was caught (skydiving planes) may be different from the fix needed for the other types of potential errors.  Tour planes and crop-dusters are not going to circle over skydiving drop zones.

This case study highlights how explainability can help us understand the generalizability of our models and help us understand how to improve the features we use to make our models more robust.

Explainability enables troubleshooting at production time

Above we discussed steps that might be done at training and evaluation time.  However, what if a generalizability error is not caught up front and the system goes into production with the latent error present.  At some point during production these failures might be manually noticed.  A reporter checking on a surveillance plane might be told by the owner that is actually a tour plan.  In this case what is the next step for the reporter?  If there is no explanation that comes with the result then it is very hard to know if this was just an outlier error that is to be expected in a correctly operating probabilistic system or if this is an actual flaw in the model which should be corrected or if the plane owner is lying to him.  With an explanation there is a much better chance that the reporter can determine appropriate next steps to distinguish between these possibilities.

Explainability matters more for “decision support systems” 

This system built by BuzzFeed is really a decision support system.  The output of the system is reviewed and evaluated by a human before any follow-up action is taken.  Plane owner’s are not automatically sent emails questioning the use of their planes, articles about government surveillance are not automatically written.  Rather the AI’s output can provide info that might prompt a human to take these steps.  However, the human is still going to apply their judgement before any tangible step is taken.  In these circumstances an explanation is very high value as it helps human understand and trust the AI prediction and helps guide the human to logical next steps.

XAI lesson from Google Clips

Screenshot 2018-02-21 14.26.58

Josh Lovejoy shared an insightful post describing what they learned while building Google Clips – a successful machine learning powered consumer product.  Looking at this through an XAI lens we see an example of providing a contrastive explanation that communicates boundary conditions.

Google Clips is an intelligent camera designed to capture candid moments of familiar people and pets. It uses completely on-device machine intelligence to learn to only focus on the people you spend time with, as well as to understand what makes for a beautiful and memorable photograph.
—  The UX of AI

This product was unconventional and the fact that it was powered by adaptive predictive logic was certainly going to be apparent to the users.

Rather than making the system as much of a black box as possible to simplify the user experience they found that exposing more of the ‘workings’, than they originally assumed would be optimal, helped users understand and accept the product.

We made sure that the user had the final say in curation; from the best still frame within a clip to its ideal duration. And we showed users more moments than what we necessarily thought was just right, because by allowing them to look a bit below the ‘water line’ and delete stuff they didn’t want, they actually developed a better understanding of what the camera was looking for, as well as what they could confidently expect it to capture in the future.
—  The UX of AI

By sharing with the user captured ‘moments’ they expect to be rejected and deleted, they are providing a contrastive explanation that illustrates the workings within the black box by providing examples of both sides of the predicted boundary condition.

Serendipity, heatmap explanations and medical insights

Screenshot 2018-02-19 12.31.20

By looking at the human eye, Google’s algorithms were able to predict whether someone had high blood pressure or was at risk of a heart attack or stroke
—  Washington Post

One of the most powerful things about machine learning is the ability to see patterns and make connections that are not at obvious to humans.  For example, it is not obvious that we should look at the retina to assess heart disease risk.  Google’s project found this connection as a side effect of a project that was working on predicting eye diseases. Now we have the potential for a lower cost less invasive technique for diagnosing heart disease.

Sometimes the surprising connections that pop out of machine learning models are misleading coincidental correlations, sometimes they are genuine new insights.  Interpretable ML techniques allow us distinguish between these and build value through new insights.  In this case for example:

Google’s technique generated a “heatmap” or graphical representation of data which revealed which pixels in an image were the most important for a predicting a specific risk factor. For example, Google’s algorithm paid more attention to blood vessels for making predictions about blood pressure.
—  USA Today

Google then used these heatmaps to gather feedback from human domain experts.

It was good to see that the team made the investment to provide and validate explainations as part of their project.  We expect the trend towards providing explanations to continue to accelerate.

For those interested in the detail behind their technique, but who don’t have a paid subscription to “Nature Biomedical Engineering”, the study PDF is also available from this link:

To better understand how the neural network models arrived at the predictions, we used a deep learning technique called soft attention 30–32 a different neural network model with fewer parameters compared to Inception-v3. These small models are less powerful than Inception-v3, and were used only for generating attention heatmaps and not for the best performances results observed with Inception-v3. For each prediction shown in Figure 2, a separate model with identical architecture was trained. The models were trained on the same training data as the Inception-v3 network described above, and the same early stopping criteria were used.
—  Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Semantically meaningful features

Our default image of a machine learning project is:

  • multi-year moonshot funded by tech giant
  • either internet scale general purpose media interpretation (image identification, language translation, etc.)
  • or sophisticated autonomous robotics (i.e. self-driving cars)
  • large volume of labeled training data
  • confident the training set effectively represents the full distribution of observations we will see in production over extended period
  • training time/cost is not a strong constraint (big network of GPUs/TPUs available as needed)
  • goal is to provide fully automated decision making
  • can apply supervised learning techniques
  • use custom algorithms developed by dedicated team of sophisticated data scientists

With this image in mind it is easy to ignore concerns about whether features are semantically meaningful to a human.  One can take the attitude that this system will be heavily and widely tested and the optimized features that are proven to perform best are the ones I should use even if they have no semantic meaning.  (In this context by “feature” I mean both the features that are fed in and any intermediate features/representations that are created and used internally).

However, the majority of real world machine learning projects are not going to fit the above profile.  They will be Enterprise projects with a very different profile:

  • limited training data that may be a mix of labeled and unlabeled
  • not initially clear if training data represent full distribution of what we will see in production
  • need to apply semi-supervised, unsupervised, reinforcement learning or transfer learning techniques
  • have meaningful cost and resource constraints
  • often a decision support system rather than fully automated decision making

In these circumstances if is often very valuable to have semantically meaningful features which provide better:

  • explainability
  • ability to assess generalizability
  • troubleshooting support
  • ability to learn effectively with less labeled data
  • etc.

We recently came across three examples of work that potentially will increase our use of semantically meaningful features.

SME concepts to seed learning process

The Bonsai platform and Inkling language can leverage subject matter experts’ concept definitions to give a starting point for the learning process.  This reduces training time and resources needed.  But perhaps more importantly it provides much more explainable and manageable models since the models are defined in terms of understood domain concepts.  Video here …

Generative model to discover latent variable model

Stefano Ermon’s Stanford team has demonstrated how to use unlabeled data to discover “latent features” which then enable effective machine learning from a very small labeled training set.  They have done some work around generating “semantically meaningful” latent features with more work to do.  Read more …

Interpretable Machine Learning through Teaching

A team from OpenAI and UC Berkeley have found that by encouraging AIs to teach each other via examples, they can identify the smallest set of examples that are required to teach a concept.  The examples can then be treated as the defining examples for a given concept and can be used as part of the explanation process.   Read more …

Risk of “Machine Learning Overkill”

Screenshot 2018-02-16 08.15.46

We are all justifiable excited about the potential of applied machine learning.  However, Venkat Raman points out the dangers of letting that excitement take us into “machine learning overkill”, when we are applying machine learning for its own sake.

‘This is the problem we are facing, tell us what Machine Learning Algorithms can be applied?’ … The newly minted Data Scientists quickly blurt out 2–3 ML algorithms and the enamored company hires him/her . In due course of time the algorithms are implemented. The Data Scientist impresses the company with good accuracy % of the models. The models are put in production. But lo and behold, the model does not net the company the ROI it hoped for … what happened was the Data Scientist did not have business acumen and thought his/her KPI was just building ‘good’ ML models.
— Venkat Raman So, How Many ML Models You Have NOT Built?

Rama Ramakrishnan highlights a best practice for avoiding this risk: Create a Common-Sense Baseline First.

Experienced practitioners do this routinely.  They first think about the data and the problem a bit, develop some intuition about what makes a solution good, and think about what to avoid. They talk to business end-users who may have been solving the problem manually. They will tell you that common-sense baselines are not just simple to implement, but often hard to beat. And even when data science models beat these baselines, they do so by slim margins.
— Rama Ramakrishnan  Create a Common-Sense Baseline First.

Rama goes on to provide three real world examples (direct marketing, product recommendations, retail price optimization) where thinking first about a baseline solution will better inform your decision making and define a benchmark for judging any potential machine learning based solution.

In choosing between the baseline implementation and a black box AI alternative we should consider not just the relative predictive power of each as measured in the lab but also our relative ability to understand, trust and manage each of them in production.

Clearly the goal shouldn’t be “find most interesting machine learning technique to solve my problem” it should be “find the best way to solve my problem and generate real business value.”  As Peadar Coyle points out, we don’t want to end up a trophy data scientist who is just the “smart nerd in the corner” and doesn’t end up “adding value to organisations”.

XAI is focused on improving our ability to understand and manage our models and through those improvements create exactly this sort of connection between model building and value creation.

Venkat also makes a good point about how in judging the applicability of machine learning we should not just consider the accuracy % as measured in the lab, but we should also make a judgement about the likely delta between our model and reality.  This requires judging the quality of data inputs, the casual relevance of features, the generalizability of the training set, etc.

‘All Models are wrong, some are useful.’ In most Machine Learning Algorithms we try to minimize the loss function.  Models are an abstraction of the reality. The word here is abstraction. It is not actual. If you think about it, the process of building Machine Learning Algorithms itself has a larger ‘Loss Function”. That is we differ from the reality.
— Venkat Raman So, How Many ML Models You Have NOT Built?

Implementing “explainable” machine learning techniques puts us in a much better position to judge how large is the delta between our model and reality.  This is part of the “insight benefit of explainability“.