## Final Result https://learn.deeplearning.ai/chatgpt-building-system/lesson/8/evaluation ## Process - Tune prompts on few examples - Add edge cases opportunistically - Develop metrics (ex. average accuracy) to measure performance on examples - Collect random sampled samples to tune - Collect a hold-out test set ## Differentiating Instructions from Input To differentiate instructions from text, use something like: ``` What is the sentiment of the following product review, which is delimited with triple backticks? Review text: '''{lamp_review}''' ``` Also avoids prompt injection ## Classification ### Identifying intent ```python delimiter = "####" system_message = f""" You will be provided with customer service queries. \ The customer service query will be delimited with \ {delimiter} characters. Classify each query into a primary category \ and a secondary category. Provide your output in json format with the \ keys: primary and secondary. Primary categories: Billing, Technical Support, \ Account Management, or General Inquiry. Billing secondary categories: Unsubscribe or upgrade Add a payment method Explanation for charge Dispute a charge Technical Support secondary categories: General troubleshooting Device compatibility Software updates Account Management secondary categories: Password reset Update personal information Close account Account security General Inquiry secondary categories: Product information Pricing Feedback Speak to a human """ user_message = f"""\ I want you to delete my profile and all of my user data""" messages = [ {'role':'system', 'content': system_message}, {'role':'user', 'content': f"{delimiter}{user_message}{delimiter}"}, ] response = get_completion_from_messages(messages) print(response) ``` ### Extracting important specifics from user queries - Can provide example for one-shot - Follow [[Iterative Prompt Development]] ```python delimiter = "####" system_message = f""" You will be provided with customer service queries. \ The customer service query will be delimited with {delimiter} characters. Output a python list of json objects, where each object has the following format: 'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \ Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>, AND 'products': <a list of products that must be found in the allowed products below> Do not output any additional text that is not in JSON format. Do not write any explanatory text after outputting the requested JSON. Where the categories and products must be found in the customer service query. If a product is mentioned, it must be associated with the correct category in the allowed products list below. If no products or categories are found, output an empty list. List out all products that are relevant to the customer service query based on how closely it relates to the product name and product category. Do not assume, from the name of the product, any features or attributes such as relative quality or price. The allowed products are provided in JSON format. The keys of each item represent the category. The values of each item is a list of products that are within that category. Allowed products: {products_and_category} """ few_shot_user_1 = """I want the most expensive computer. What do you recommend?""" few_shot_assistant_1 = """ [{'category': 'Computers and Laptops', \ 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}] """ few_shot_user_2 = """I want the most cheapest computer. What do you recommend?""" few_shot_assistant_2 = """ [{'category': 'Computers and Laptops', \ 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}] """ messages = [ {'role':'system', 'content': system_message}, {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"}, {'role':'assistant', 'content': few_shot_assistant_1 }, {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"}, {'role':'assistant', 'content': few_shot_assistant_2 }, {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"}, ] return get_completion_from_messages(messages) ``` ## Moderation ### Human decency OpenAI has an actual endpoint tool to see if it complies with openAI policies as well ``` response = openai.Moderation.create( input=""" Here's the plan. We get the warhead, and we hold the world ransom... ...FOR ONE MILLION DOLLARS! """ ) moderation_output = response["results"][0] print(moderation_output) ``` Will get true/false values, and a rating out of 100, and overall flagged. ### Preventing Injection Just replace any delimiter characters ```python # remove possible delimiters in the user's message input_user_message = input_user_message.replace(delimiter, "") ``` [[GPT-4]] is better at following system messages ### [[Chain of thought prompting]] ## Checking Outputs - Can use moderation ### General check for final Prior to send ```` system_message = f""" You are an assistant that evaluates whether \ customer service agent responses sufficiently \ answer customer questions, and also validates that \ all the facts the assistant cites from the product \ information are correct. The product information and user and customer \ service agent messages will be delimited by \ 3 backticks, i.e. ```. Respond with a Y or N character, with no punctuation: Y - if the output sufficiently answers the question \ AND the response correctly uses product information N - otherwise Output a single letter only. """ ```` ### Check with input and middle steps included Doesn't recommend doing it due to price, only if you need high accuracy rates ```` q_a_pair = f""" Customer message: ```{customer_message}``` Product information: ```{product_information}``` Agent response: ```{final_response_to_customer}``` Does the response use the retrieved information correctly? Does the response sufficiently answer the question Output Y or N """ ```` ## Testing - Usually test examples are built during usage, not during it ### Specific expected results #### Test set - Have 10 ideal examples - Make sure it continues returning the expected JSON data ### Abstract generated text - Create a rubric #### Based on context - Context can be something like extracted product information ``` user_message = f"""\ You are evaluating a submitted answer to a question based on the context \ that the agent uses to answer the question. Here is the data: [BEGIN DATA] ************ [Question]: {cust_msg} ************ [Context]: {context} ************ [Submission]: {completion} ************ [END DATA] Compare the factual content of the submitted answer with the context. \ Ignore any differences in style, grammar, or punctuation. Answer the following questions: - Is the Assistant response based only on the context provided? (Y or N) - Does the answer include information that is not provided in the context? (Y or N) - Is there any disagreement between the response and the context? (Y or N) - Count how many questions the user asked. (output a number) - For each question that the user asked, is there a corresponding answer to it? Question 1: (Y or N) Question 2: (Y or N) ... Question N: (Y or N) - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number) """ ``` #### Based on ideal or human expert answer ##### Example Ideal Answer ``` test_set_ideal = { 'customer_msg': """\ tell me about the smartx pro phone and the fotosnap camera, the dslr one. Also, what TVs or TV related products do you have?""", 'ideal_answer':"""\ Of course! The SmartX ProPhone is a powerful \ smartphone with advanced camera features. \ For instance, it has a 12MP dual camera. \ Other features include 5G wireless and 128GB storage. \ It also has a 6.1-inch display. The price is $899.99. The FotoSnap DSLR Camera is great for \ capturing stunning photos and videos. \ Some features include 1080p video, \ 3-inch LCD, a 24.2MP sensor, \ and interchangeable lenses. \ The price is 599.99. For TVs and TV related products, we offer 3 TVs \ All TVs offer HDR and Smart TV. The CineView 4K TV has vibrant colors and smart features. \ Some of these features include a 55-inch display, \ '4K resolution. It's priced at 599. The CineView 8K TV is a stunning 8K TV. \ Some features include a 65-inch display and \ 8K resolution. It's priced at 2999.99 The CineView OLED TV lets you experience vibrant colors. \ Some features include a 55-inch display and 4K resolution. \ It's priced at 1499.99. We also offer 2 home theater products, both which include bluetooth.\ The SoundMax Home Theater is a powerful home theater system for \ an immmersive audio experience. Its features include 5.1 channel, 1000W output, and wireless subwoofer. It's priced at 399.99. The SoundMax Soundbar is a sleek and powerful soundbar. It's features include 2.1 channel, 300W output, and wireless subwoofer. It's priced at 199.99 Are there any questions additional you may have about these products \ that you mentioned here? Or may do you have other questions I can help you with? """ } ``` #### Compare with rubric ``` def eval_vs_ideal(test_set, assistant_answer): cust_msg = test_set['customer_msg'] ideal = test_set['ideal_answer'] completion = assistant_answer system_message = """\ You are an assistant that evaluates how well the customer service agent \ answers a user question by comparing the response to the ideal (expert) response Output a single letter and nothing else. """ user_message = f"""\ You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: {cust_msg} ************ [Expert]: {ideal} ************ [Submission]: {completion} ************ [END DATA] Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: (A) The submitted answer is a subset of the expert answer and is fully consistent with it. (B) The submitted answer is a superset of the expert answer and is fully consistent with it. (C) The submitted answer contains all the same details as the expert answer. (D) There is a disagreement between the submitted answer and the expert answer. (E) The answers differ, but these differences don't matter from the perspective of factuality. choice_strings: ABCDE """ messages = [ {'role': 'system', 'content': system_message}, {'role': 'user', 'content': user_message} ] response = get_completion_from_messages(messages) return response ```