In the paper “Towards a Human-like Open-Domain Chatbot," published by Google Research, Brain Team, the authors introduce Meena, a 2.6 billion parameter neural conversational model that has been trained end-to-end. The research shows that Meena can conduct conversations that are more sensible and specific than previous state-of-the-art chatbots. This is reflected by a new human evaluation metric, called Sensibleness and Specificity Average (SSA), which measures basic but crucial characteristics of human conversation. The study also shows that there is a strong link between the SSA metric and perplexity, which is a commonly used automatic metric for conversational models.
Meena is an end-to-end neural conversational model that learns to react rationally in various conversational contexts. The model is trained on 40B words mined and filtered from public-domain social media conversations, and the main goal of training is to lower perplexity, which is a measure of how hard it is to guess the next token, or word, in a conversation.
The authors of the paper have made three main contributions:
They proposed a new human evaluation metric, Sensibleness and Specificity Average (SSA), for multi-turn open-domain chatbots that capture basic but important attributes of human conversation.
They showed that perplexity, an automatic metric, correlates well with human judgment, which is a contrast to recent findings on other automatic metrics.
They demonstrated that an end-to-end neural model with sufficiently low perplexity can surpass the sensibleness and specificity of existing chatbots that rely on complex, handcrafted frameworks developed over many years.
The dataset used to train Meena is mined from public-domain social media conversations. The data is in the form of message trees involving multiple speakers, with the first message being the root and replies being child nodes. Each path along the tree is treated as a conversation, with each message being a conversation turn. By using each turn as a response and all previous turns as context, training examples in the form of (context, response) pairs are created. The final Meena dataset contains 341 GB of text, which is significantly larger than GPT-2, which was trained on 40 GB of internet text.
The model is based on the Evolved Transformer seq2seq architecture, which is a type of Transformer architecture that has been discovered through an evolutionary search for neural architecture to improve perplexity.
Meena’s architecture includes a single Evolved Transformer encoder block and thirteen Evolved Transformer decoder blocks, as illustrated. The encoder is responsible for processing the context of the conversation and helping Meena understand what has already been said. The decoder then uses this information to formulate a response. Through adjustments of the hyper-parameters, the research team discovered that a more powerful decoder was key to achieving higher conversational quality.
Generating generic and bland responses has been a major challenge in existing neural conversational models. Common approaches to solving this problem include using more sophisticated decoding algorithms or new frameworks such as adversarial learning or variational autoencoding, but these methods come with added complexity and less scalability. The authors propose a simpler solution, showing that a model with sufficiently low perplexity can achieve diverse and high-quality responses using a simple sample-and-rank decoding strategy. This strategy involves sampling N independent candidate responses using random sampling with a temperature T, and selecting the candidate response with the highest probability as the final output. The temperature T is a hyper-parameter that regulates the probability distribution of the next token during decoding.
A new human evaluation metric called Sensibleness and Specificity Average (SSA) is being proposed to measure the quality of chatbots like Meena. The metric combines two fundamental aspects of a human-like chatbot: making sense and being specific. Human judges are asked to label each model response on these two criteria. The first part of the metric, sensibleness, is a basic requirement for proper conversation with a human. But, making sense alone is not enough. The SSA metric also includes a second dimension, which evaluates whether a response is specific in context, to prevent bots from hiding behind vague replies and to examine their capabilities more openly.
The authors compared Meena, humans, and other open-domain chatbots using the SSA metric with two types of human evaluation: static and interactive. For static evaluation, they used a dataset of 1,477 multi-turn conversations. For interactive evaluation, humans could chat about anything they wanted.
The results showed that the SSA metric had a strong correlation with Meena’s perplexity, both in static and interactive evaluation. This means that the better Meena fit its training data, the more sensible and specific its chat responses became. This correlation was surprising to the authors, as recent research found a poor correlation between human evaluation scores and automatic metrics such as BLEU.
They compared Meena to well-known open-domain chatbots like Mitsuku, Cleverbot, XiaoIce, and DialoGPT. They used a crowd-sourced free-form conversation with the chatbots being tested, starting each conversation with the same greeting, “Hi!”. For each utterance, the crowd workers answered two questions: “Does it make sense?” and “Is it specific?”. The evaluator was asked to use common sense to judge if a response was completely reasonable in context. For each chatbot, they collected between 1600 and 2400 individual conversation turns through about 100 conversations. Each model response was labeled by crowd workers to indicate if it was sensible and specific. The sensibleness of a chatbot is the fraction of responses labeled "sensible," and specificity is the fraction of responses that are marked "specific." The average of these two is the SSA score. The results show that Meena performs much better than existing state-of-the-art chatbots by large margins in terms of SSA scores, and is closing the gap with human performance.
Also, as mentioned before, the authors discovered that perplexity, an automatic metric that is readily available to any neural seq2seq model, exhibits a strong correlation with human evaluation, such as the Sensibleness and Specificity Average (SSA) value. Perplexity measures the uncertainty of a language model, and the lower the perplexity, the more confident the model is in generating the next token. During development, they benchmarked eight different model versions with varying hyperparameters and architectures, such as the number of layers, attention heads, total training steps, whether the Evolved Transformer or regular Transformer was used, and whether the model was trained with hard labels or distillation. The results show that the lower the perplexity, the better the SSA score for the model, with a strong correlation coefficient (R2 = 0.93).
The authors trained an end-to-end Meena model (referred to as Meena (base)) that achieved a perplexity of 10.2 and an SSA score of 72%. This SSA score is close to the SSA score of 86% achieved by the average person. They also trained a full version of Meena, which has a filtering mechanism and tuned decoding, which further advances the SSA score to 79%.
Interactive SSA vs. Perplexity. Each blue dot is a different version of the Meena model. A regression line is plotted demonstrating the strong correlation between SSA and perplexity. Dotted lines correspond to SSA performance of humans, other bots, Meena (base), our end-to-end trained model, and finally full Meena with filtering mechanism and tuned decoding. — source
Read more at: https://medium.com/aiguys/meena-towards-a-human-like-open-domain-chatbot-2cdef3e0f892