gouri

Response Correctness Evaluation Voice and Text (Condensed Version)
Claim Splitting and Fact Checking













Version: 1.0

Prepared by: KCAE

Date: November 7, 2025

Approved By: 

Version History


Version

Date

Description of changes

Initials

v1.0

Nov.7, 2025

Condensed Version for Onboarding New DAs

KCAE


  • Step 1: Response evaluation

    • Determine if the response is a deflection.

    • Determine if the response is relevant, specific and timely.

    • Split the response into claims.

    • Determine whether the response is correct.

      • Fact-check each claim response to identify hallucinations.

      • Label the incorrect claims if they fall under: minor issues, major issues, inconclusive

      • Identify the core claims/answers.

      • Identify the reason for the incorrectness of the claims/response: core fact incorrect, additional facts incorrect, not relevant, not specific, not timely

Note: We use the terms question/query/user request interchangeably in this guideline and they all refer to the input from the user.

Step 1: Response Evaluation


The tool may provide multiple candidate answers for each utterance. A candidate response can be a short sentence, a long extended paragraph, or a request for clarification, creative writing, instructions on how to complete a task, a list of products etc. For each candidate response, evaluate the following:


  • Is the response a deflection?

  • Is the response relevant?

  • Is the response correct?

Step 1.1 Is the response a DEFLECTION?


A response is a deflection when the system did not provide an answer to the query, which is usually because of system errors or limitations. 


Example of deflection phrases: 


  "I'm sorry",

  "I apologize",

  "I am sorry",

  "FAILED TO CAPTURE RESPONSE",

  "Sorry but",

  "I don't have",

  "I am unable",

  "I'm unable",

  "Alexa+ is experiencing an interruption in service",

  "Alexa+ system is temporarily unavailable",

  "system is temporarily unavailable


Do not  label this step if the query was found unintelligible, ambiguous, with harmful intent, or it is not seeking for information.

Step 1.2 Is the response RELEVANT?


A relevant answer should provide information to the user that directly addresses the question being asked. At this step, there’s no need to fact-check the response. 


Consider the following factors in assessing the relevance of the response:


  • Relevance: A relevant answer should provide information that is directly related to the topic or subject matter of the question.

  • Specificity: A relevant answer should not be too general or vague, but should provide specific information to the user.

  • Timeliness: The relevance of an answer may be affected by the current time, location or other contextual factors.


Yes, if the system provided a relevant and specific response. 


No, if the response failed to address the user’s request. Choose the following for the reason of irrelevancy. More than one label is allowed.


  • Not relevant, if the response is not related to the question.


Prompt: why does my cat attack me out of the blue

Response: 

A dog may "attack out of the blue" due to a medical issue, fear, stress, or resource guarding, often triggered by subtle signs of discomfort that were missed.


  • Not specific, if the response is relevant to the topic, but did not directly answer the query.

Prompt: why does my cat attack me out of the blue

Response: 

Cats can sometimes exhibit sudden aggressive behavior for various reasons.


  • Not timely, if the query asked for a specific date but the system provided an outdated or different information. However, if there’s no indicated date in the query, and the system provided a relevant, specific, but outdated response, consider it relevant and must be fact-checked. It should be negated in the fact-checking process.


Example of Not timely response (no fact-checking)

Prompt: what day is Mother’s Day celebrated on 2026

Answer Date: November 5, 2025

Answer: Mother’s Day was celebrated on May 11, 2025. 


Example of Relevant response (to be fact-checked and negated with timely information)

Prompt: when is Mother’s Day

Answer Date: November 5, 2025

Answer: Mother’s Day will be celebrated on May 11, 2025.


Do not label this step if the query was found unintelligible, ambiguous, with harmful intent, or it is not seeking for information.


  • If the response is a deflection due to system limitations or errors (examples in Step 3.1), mark it as not relevant and not specific.


  • If the response says “I don’t have information/ I can’t find…” consider it as relevant, and assess if the deflection is correct in the fact-checking. 

If the evaluation “Is the response relevant” was answered No, the process stops here.

Step 1.4 Is the response CORRECT?


A correct answer should be informative and provide valid information to the user that directly addresses the question being asked. 

Step 1.2.1 Enhanced Fact-checking
Step 1.2.1.1 Claims Identification


  • Step 1: We need to identify all claims within the response.  A claim is a text segment containing a statement of fact that can be proved or disproved with evidence. A response may consist of zero to many claims. 

    There can be multiple claims within a sentence. A claim may also span multiple sentences. For example: “All birds fly” is a claim. A piece of text that makes a topic introduction or an overall conclusion is not considered a claim, for example, “Here is a list of birds that cannot fly”. Common knowledge does not need to be identified as a claim (e.g. “Sleep is important to our overall health”) unless they are about something that has been widely discredited or invalidated (e.g. “Everyone needs 8 hrs of sleep a day”).

 Additional Guidance:  Understanding Recommendations, Opinions, and Non-Claims 

  • Recommendations and opinions reflect personal views or judgments are not factual claims.

  • These statements depend on subjective factors like reviews, ratings, or personal preferences, which can vary widely.

  •  When annotating, focus only on factual claims about products, places, or people. Do not treat subjective terms like “best,” “worst,” or “most popular” as claims because they depend on personal or source-based criteria.

 Checklist for Identifying Claims vs. Non-Claims

You should ask:

  • Does it depend on objective facts, not personal opinions or preferences?

  • Does it avoid subjective terms like “best,” “worst,” or “most popular”?

If all answers are yes, treat it as a factual claim.
If any answer is no, it is likely a recommendation or opinion and should not be considered as a claim.

Claim-Splitting (Voice-Text)

EXAMPLES


CLAIM-SPLITTING

Prompt: what lake does chicago border


Response: 

Chicago is located on the southwestern shore of Lake Michigan, one of the five Great Lakes of North America. This massive freshwater lake spans all 28 miles of Chicago's shoreline, offering beautiful views and recreational opportunities for residents and visitors alike. Lake Michigan plays a vital role in Chicago's economic, social, and recreational life, serving as a scenic backdrop for the city's iconic skyline.


- Chicago is located on the southwestern shore of Lake Michigan

- one of the five Great Lakes of North America.

- This massive freshwater lake spans all 28 miles of Chicago's shoreline,

- offering beautiful views and recreational opportunities for residents and visitors alike. 

- Lake Michigan plays a vital role in Chicago's economic, social, and recreational life,

- serving as a scenic backdrop for the city's iconic skyline.

Prompt: what does the word susceptible mean


Response: 

The word 'susceptible' means easily influenced, affected, or harmed by something. It describes a vulnerability or sensitivity to certain influences, like illnesses or environmental factors.


- The word 'susceptible' means easily influenced, affected, or harmed by something. 

- It describes a vulnerability or sensitivity to certain influences, 

- like illnesses or environmental factors.

Prompt: what vitamin does apple have


Response: 

Apples are a nutritious fruit that contains several important vitamins. According to the most recent factual information:\n\n * Vitamin C: 14% of the daily recommended intake\n * Vitamin A: 2% of the daily recommended intake\n* Vitamin B6: 4% of the daily recommended intake\n* Vitamin K: 5% of the daily recommended intake\n\n Did you know that apples are also a good source of dietary fiber and potassium? They're a well-rounded fruit that can contribute to a healthy diet.


- Vitamin C: 14% of the daily recommended intake

- Vitamin A: 2% of the daily recommended intake

- Vitamin B6: 4% of the daily recommended intake

- Vitamin K: 5% of the daily recommended intake

- apples are also a good source of dietary fiber and potassium


This hotel is the best in the city. It has a rooftop pool and free breakfast.

Claim: It has a rooftop pool and free breakfast.
Non-claim: This hotel is the best in the city

Prompt: the best player on the packers team

Response date: 07-14-2025

Response: Rashan Gary is currently considered the best player on the Green Bay Packers team.  He's a standout defensive end who recently made NFL.com's annual ranking of top 100 players.

Prompt: who is the best center fielder of all time

Response date: 07-19-2025

Response:  Willie Mays is widely considered the greatest center fielder of all time.  He played 2,829 games in center field, more than any other player in history.


Factual Claim: If multiple credible sources (e.g., official records, verified sports statistics, recognized experts) consistently agree on the information, mark it as a factual claim. Example: "Steph Curry holds the record for the most 3-pointers made in a season" is verifiable through official NBA statistics.

Opinion/Recommendation: If the statement reflects a personal view or cannot be objectively measured and could vary from person to person (e.g., “Steph Curry is the best shooter ever”), it’s an opinion.

Step 1.2.1.2 Claims Verification


  • Step 2a: In step 2, we need to verify the factual accuracy of each claim. You need to refer to the input context (if available) and query to make your assessment, by verifying the factuality of the claims using trustworthy sources online. There can be claims that are time-sensitive and are true at the date and time of the input context. For each claim identified in step 1, select:


  • Correct if you are able to verify that the claim is accurate with information online

  • Incorrect if the claim is disproved with evidence you found


Reminder: Always set your Google default location to US. 


Go to Google page >> click Setting >> Search Setting >> Other Setting >> Language & Region >> Results region >> Choose United States.


Cases requiring flexible judgement


Some claims may not have a single “correct” value but still fall within an acceptable range. In these cases, a claim should be marked as Correct if it:

  • Accurately reflects the state of events on the same day, or

  • Falls within a reasonable range according to credible sources.


This applies to the following types of claims:

  • Dynamic Financial Indicators: These values can fluctuate rapidly and are often time-specific. Examples include:  

    • Stock prices 

    • Individual net worth

    • Exchange rates and interest rates

    • Real-time rates for services (e.g., flights, Ubers, hotels)

    • Market capitalization of companies

  • Live Events: These should be considered correct if the response matches the event status at any point on the referenced day. Examples include: 

    • Sports scores

    • Election results

    • Poll ratings (e.g., presidential approval ratings, election polling, TV show viewer ratings, etc)

  • Measurements Based on Estimates: Such values vary by size, quality, preparation, or source and should be judged with reasonable flexibility. Examples include:

    • Nutrient content in food

    • Price of rare items (e.g., minerals, rare coins, exotic cars, fine art)

    • Construction costs

    • Time to complete a task (e.g., “It takes 2 hours to hike this trail”)

  • Business Attributes: For details such as phone numbers, opening hours, accessibility, menu items/pricing, or accepted payment methods, fact-check using the sources below in the following order. Mark the response as correct if:

    • The attribute matches the business’s official website, or

    • If no official website is available, the attribute matches any one of the next three sources

      • Google Maps

      • Yelp (if available)

      • TripAdvisor (if available)

      • TravelMath

      • Trippy

  • Relative Geographic Location: Do not penalize minor directional inaccuracies in responses (e.g., “north” instead of “northeast”) if the response is still broadly accurate.



Table 8: Cases requiring flexible judgement

Type of Claims

Example Queries

Example Responses

What sources say

Rationale

Dynamic financial indicators

what is the value of one bitcoin (on May 23 2025)

One Bitcoin is currently worth $108,800, which is down 2.4% over the last 24 hours.

$106,841.30 - $111,798.91

Claims are between the high and low for the day, therefore should be graded as correct


price of solana crypto(on May 23 2025)

The current price of Solana is $180.90, up less than 1% over the last 24 hours.

$173.57-$187.28



who is the wealthiest man on earth (as ofJune 8 2025)

Elon Musk is currently the wealthiest person in the world, with a net worth of $342 billion. He's the CEO of Tesla and SpaceX.

Musk remains the richest person on the planet with an estimated total net worth of $342 billion as of Sunday, June 8.




what is the price of microsoft (as of June 10 2025)

Microsoft's stock is currently trading at $468.47 on the NASDAQ, down 0.9% since the previous close.

$466.96 - $472.80



Measurements based on estimates

does butter have cholesterol

Yes, butter does contain cholesterol. According to recent information, one pat of butter has about 10.8 milligrams of cholesterol.

10.75mg for 1 pat of butter

11mg for 1 pat of butter


Claims are within the range of answers on the internet. Therefore should be graded as correct


how much was a pack of cigarettes in 1984

In 1984, the average price of a pack of cigarettes was about 80 cents. This was significantly lower than today's prices, which average around $6.16 per pack.

91 cents a pack in 1984

72 cents a pack in 1984

$1.19 a pack in 1984



what is a healthy weight for a 50 year old woman 5'8“

Based on the information available, a healthy weight range for a 50-year-old woman who is 5'8\" tall is typically between 125 and 164 pounds. This range is based on a moderate body mass index (BMI) of 18.5 to 24.9, which is generally considered healthy for most adults.

126 - 154 lbs

125–164 lbs



how much does it cost to fly air force one

Based on recent reports, the cost to fly Air Force One ranges from about $177,000 to $200,000 per hour. This includes fuel, crew, and maintenance expenses.

about $177,843 per flight hour in 2022

about $200,000 per hour in 2025



how many calories does a red apple have

A medium red apple typically contains about 115 calories. This can vary slightly depending on the specific type and size of the apple.

A medium red apple ranges from 70 to 125 calories according to various sources.


Live events

What’s Trump’s approval rating (As of June 14 2025)

President Trump's approval rating is 46.7% as of yesterday, according to RealClearPolitics. This represents a slight increase of 0.8% over the last month.

The claim is accurate as per source mentioned. Other sources give slightly different numbers around the same figure

If the state represents in question was true during the particular day, the response should be graded as correct


Who won the US opens (June 12 2025)

The 2025 U.S. Open is currently underway. J.J. Spaun is leading at 4 under par, with Si Woo Kim, Gary Woodland, and Thomas Detry tied for second at 2 under par.

Question potentially asked while the game was ongoing and the response reflected the score


Local attributes for businesses

northern tier library opens when

The Northern Tier Library opens at 9:00 AM on Tuesdays.

Google Maps gives 9AM-7PM on Tuesdays

Library website also gives 9AM-7PM on Tuesday

This (outdated, 2020)news site gives 10AM-6PM 


Should be graded as correct as it aligns with Website


What time does Goat Tree close?

Goat tree in Santa Barbara is open until 5PM on Thursdays

Website gives dine-in closes at 5PM but Grab and Go closes at 8PM

Google Maps gives 5PM

Yelp gives 5PM

TripAdvisor gives 7.30PM

OpenTable gives 7.30PM

 


Should be graded as correct (any answer that gives 5PM or 8PM as a response should be correct)


Contact number for Goat Tree restaurant

805-882-0137

Website, Google and Yelp give 805-882-0137

TripAdvisor gives 805-882-0100

Should be graded as correct (TripAdvisor number should be considered incorrect)

Relative geographic location

Where in the world is Ukraine?

Ukraine is located in Eastern Europe. It's bordered by Russia to the east, Belarus to the north, Poland, Slovakia, and Hungary to the west, and Romania, Moldova, and the Black Sea to the south”

Romania and Moldova are mostly to the southwest and south-southwest of Ukraine, not strictly directly south.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


where is the dead sea

The Dead Sea is located in southwestern Asia, bordered by Jordan to the east and Israel and the West Bank to the west. It's a landlocked salt lake, known for being the lowest point on Earth's surface, at 439.78 meters below sea level.

Israel is more to the south-west than the west, but it would not be considered wrong to have this statement factually wrong.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


Where is the Himalayas mountain range?

The Himalayas stretch across southern Asia, running along the northern border of India and separating it from China and Nepal.

Nepal is mostly within the Himalayas rather than just separated by them.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


Where is the Philippines located?

The Philippines is in Southeast Asia, east of Vietnam and south of Taiwan.

It's more southeast of Taiwan than directly south.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


Additional Fact-checking Examples and Reminders

Feedback

Category

Examples

If the response mentions a specific time with corresponding values, points, and percentages, use the Yahoo Finance day's range for the stock indices.

Business / Finance

Prompt: what is the stock price of tesla

Date of Query: 2025-07-02

Response: As of 8:48 AM today, Tesla's stock is trading at $313.68 on the NASDAQ, up 4.3% since the previous close.  This price was observed at 9:07 AM Pacific Daylight Time on Wednesday, July 2, 2025.


Claim: As of 8:48 AM today, Tesla's stock is trading at $313.68 on the NASDAQ


Evidence on Yahoo Finance: https://ca.finance.yahoo.com/quote/TSLA/history/ 


NasdaqGS - Nasdaq Real Time Price

USD

Tesla, Inc. (TSLA)


Jul 2, 2025

High  316.83

Low   303.82

Close 315.65


Jul 1, 2025

Close 300.71


Comment: Factually correct. $313.68 is within the day range of $303.82 - $316.83. 


Claim: up 4.3% since the previous close


Computation: 

[(Claim - Prev. Closing Price) / Prev. Closing Price] x 100%


313.68   -   300.71 = 12.97   ÷   300.71 = 0.04313125602740181570283661999933   ×   100 = 4.3131256027401815702836619999335 or up by 4.3%


Comment: The percentage change is factually correct.


Note: If the points difference or percentage change is incorrect, include the correct computed value in the evidence section. Do not include the computation in the evidence.


Mark the claims as not timely if the tenses do not align with the query date.


Structure of the Response

Prompt: how many more seasons of the simpsons will there be

Response date: 2025-03-31


Response: 

The Simpsons has been renewed for seasons 35 and 36, which will air through the 2024-25 season. While there's no official word on further renewals, the show remains popular and could continue beyond that.


Note: At the query date, Season 35 was already concluded while Season 36 was ongoing, so it should be marked as not timely.


As long as the response is relevant, fact check. 

  • Right deflection - mark it as correct

  • If wrong - not relevant and not specific

  • Goal - factuality of the responses - limitations of the model (privacy) should not negatively impact the response


Deflections

a. Prompt: are cheerios good

Response date: 2025-07-12


Response: I apologize, but I'm currently unable to access the information about Cheerios.  Is there anything else I can help you with? 


b. Prompt: news on president trump

Response date: 2025-07-14


Response: I'm sorry, I can't look up real-time news right now. Is there something specific about President Trump you'd like to know?


a-b.Relevant - No

Not relevant and not specific - No fact check needed


c. Prompt: is there a dodgers game today

Response date: 2025-07-11


Response: I don't have information about the Dodger’s game today..


d. Prompt: has p diddy been sentenced

Response date: 2025-07-11


Response: I don't have current information about P Diddy's legal status..


Relevant - Yes

  • If there is available information, provide the answer to the prompt, make sure the evidence is still within the date of query. Marked the response as incorrect, major issue, core fact incorrect.

  • If no information can be found on the web, treat the deflection as correct. 


e. Prompt: how much does a cane corso puppy cost


Response: I apologize, but I'm currently unable to retrieve the most up-to-date information on Cane Corso puppy prices. Prices can vary widely depending on factors like breeder reputation, location, and the puppy's pedigree. Generally, Cane Corso puppies from reputable breeders can be quite expensive due to their popularity and the costs associated with responsible breeding practices.\n\n Here's some general information about Cane Corso puppies:\n\n 1. **Price Range**: Typically, Cane Corso puppies can cost anywhere from $1,000 to $4,000 or more.\n\n 2. **Factors Affecting Price**:\n - Bloodline and pedigree\n - Breeder reputation\n - Geographic location\n - Color and physical characteristics\n - Health certifications and genetic testing\n\n 3. **Additional Costs**: Remember to factor in ongoing expenses such as:\n - Food (Cane Corsos are large dogs and eat a lot)\n - Veterinary care\n - Training\n - Supplies (bed, crate, toys, etc.)\n\n 4. **Adoption Alternative**: Consider adopting from a Cane Corso rescue organization, which can be less expensive and gives a home to a dog in need.\n\n 5. **Research Thoroughly**: Due to their size and temperament, Cane Corsos require experienced owners. Ensure you're prepared for the commitment before purchasing.\n\n Would you like more information about Cane Corsos in general, or are you interested in specific aspects of owning this breed?


Relevant - Yes

Exclude the deflection part of the response and claim split the valid claims

Fact check as is the valid claims


Provide the latest information to negate outdated claims.

Others

Prompt: where is st lawrence university

Answer Date: 3/27/25


Response: 

St. Lawrence University is located in Canton, New York. It's a private liberal arts college founded in 1856, situated on a 1,100-acre campus in a rural setting. The university offers over 40 majors across various disciplines and has an enrollment of around 2,250 students.


Claim: situated on a 1,100-acre campus in a rural setting.

Evidence used: It has a total undergraduate enrollment of 2,060 (fall 2023), and the campus size is 1,100 acres.


Available information: https://www.stlawu.edu/offices/institutional-research/slu-quick-facts 


Statistics for Fall 2024 as of October 1

Student Enrollments        Headcount

Overall Enrollment                   1,991



Step 2b: (skip this step and mark it as null or None if your response to the claim in step 2a was “Correct”). Assess the magnitude of the claim inaccuracy if the claim was labeled as incorrect. Assess the magnitude of the fact omission if the claim was labeled as partially correct. Assess the significance of the claim if the claim was labeled as inconclusive.


  • Minor if most readers would not notice the error, find it jarring or deem it significant. If printed in a newspaper, the newspaper may not need to print a correction.

  • Major if most readers knowledgeable in the space would likely recognize the error. If printed in a newspaper, the newspaper would have to print a correction or retraction to maintain its reputation.

  • Inconclusive if you found diverse information online that both support and not support the claim, which lead you to not be certain if the claim is true or false 

Step 1.2.1.3 Core Answer Identification


  • Step 3a: In step 3, we need to identify all the core answers among the claims in the response. A core answer is the main idea or the defining aspect that addresses the query. There can be multiple claims labeled as core answers. For each claim, select:


  • TRUE if the claim is a core answer

  • FALSE if the claim is an additional information

Example:

Prompt: how old is elton john

Response Date: 11-05-2025


Elton John is 78 years old. He was born on March 25, 1947.


> The core claim is - Elton John is 78 years old

Step 1.2.1.4 Reason for the Incorrectness


  • Step 4a: In step 4, we need to label the response on what grounding does it make it incorrect based on all the false claims found. One or more labels can be used on this part. After fact-checking the whole response, select the following reasons if it was found incorrect:


  • Core fact incorrect - if the main answer to the query is false

  • Additional facts incorrect - if the additional information to the query or topic is false

  • Not timely - if the response contains outdated information

  • Not relevant - if the response has claims with information not related to the main topic

  • Not specific - if the response has provided relevant information but does not specifically answer the query. 

Step 1.2.1.5 Link to source of information

Step 5a: For each claim marked as Correct or Incorrect, include the main source link used to support your decision.


Reminder: Wikipedia can be used as evidence, but do not provide information from unreliable sources such as Medium, Reddit, Quora, blogs, social media sites, AI tools / LLMs, etc. Please refer to the main guidelines for the complete list. Response Correctness Evaluation: Voice and Text.docx

Correctness Evaluation Voice and Text (Condensed Version)
Claim Splitting and Fact Checking













Version: 1.0

Prepared by: KCAE

Date: November 7, 2025

Approved By: 

Version History


Version

Date

Description of changes

Initials

v1.0

Nov.7, 2025

Condensed Version for Onboarding New DAs

KCAE


  • Step 1: Response evaluation

    • Determine if the response is a deflection.

    • Determine if the response is relevant, specific and timely.

    • Split the response into claims.

    • Determine whether the response is correct.

      • Fact-check each claim response to identify hallucinations.

      • Label the incorrect claims if they fall under: minor issues, major issues, inconclusive

      • Identify the core claims/answers.

      • Identify the reason for the incorrectness of the claims/response: core fact incorrect, additional facts incorrect, not relevant, not specific, not timely

Note: We use the terms question/query/user request interchangeably in this guideline and they all refer to the input from the user.

Step 1: Response Evaluation


The tool may provide multiple candidate answers for each utterance. A candidate response can be a short sentence, a long extended paragraph, or a request for clarification, creative writing, instructions on how to complete a task, a list of products etc. For each candidate response, evaluate the following:


  • Is the response a deflection?

  • Is the response relevant?

  • Is the response correct?

Step 1.1 Is the response a DEFLECTION?


A response is a deflection when the system did not provide an answer to the query, which is usually because of system errors or limitations. 


Example of deflection phrases: 


  "I'm sorry",

  "I apologize",

  "I am sorry",

  "FAILED TO CAPTURE RESPONSE",

  "Sorry but",

  "I don't have",

  "I am unable",

  "I'm unable",

  "Alexa+ is experiencing an interruption in service",

  "Alexa+ system is temporarily unavailable",

  "system is temporarily unavailable


Do not  label this step if the query was found unintelligible, ambiguous, with harmful intent, or it is not seeking for information.

Step 1.2 Is the response RELEVANT?


A relevant answer should provide information to the user that directly addresses the question being asked. At this step, there’s no need to fact-check the response. 


Consider the following factors in assessing the relevance of the response:


  • Relevance: A relevant answer should provide information that is directly related to the topic or subject matter of the question.

  • Specificity: A relevant answer should not be too general or vague, but should provide specific information to the user.

  • Timeliness: The relevance of an answer may be affected by the current time, location or other contextual factors.


Yes, if the system provided a relevant and specific response. 


No, if the response failed to address the user’s request. Choose the following for the reason of irrelevancy. More than one label is allowed.


  • Not relevant, if the response is not related to the question.


Prompt: why does my cat attack me out of the blue

Response: 

A dog may "attack out of the blue" due to a medical issue, fear, stress, or resource guarding, often triggered by subtle signs of discomfort that were missed.


  • Not specific, if the response is relevant to the topic, but did not directly answer the query.

Prompt: why does my cat attack me out of the blue

Response: 

Cats can sometimes exhibit sudden aggressive behavior for various reasons.


  • Not timely, if the query asked for a specific date but the system provided an outdated or different information. However, if there’s no indicated date in the query, and the system provided a relevant, specific, but outdated response, consider it relevant and must be fact-checked. It should be negated in the fact-checking process.


Example of Not timely response (no fact-checking)

Prompt: what day is Mother’s Day celebrated on 2026

Answer Date: November 5, 2025

Answer: Mother’s Day was celebrated on May 11, 2025. 


Example of Relevant response (to be fact-checked and negated with timely information)

Prompt: when is Mother’s Day

Answer Date: November 5, 2025

Answer: Mother’s Day will be celebrated on May 11, 2025.


Do not label this step if the query was found unintelligible, ambiguous, with harmful intent, or it is not seeking for information.


  • If the response is a deflection due to system limitations or errors (examples in Step 3.1), mark it as not relevant and not specific.


  • If the response says “I don’t have information/ I can’t find…” consider it as relevant, and assess if the deflection is correct in the fact-checking. 

If the evaluation “Is the response relevant” was answered No, the process stops here.

Step 1.4 Is the response CORRECT?


A correct answer should be informative and provide valid information to the user that directly addresses the question being asked. 

Step 1.2.1 Enhanced Fact-checking
Step 1.2.1.1 Claims Identification


  • Step 1: We need to identify all claims within the response.  A claim is a text segment containing a statement of fact that can be proved or disproved with evidence. A response may consist of zero to many claims. 

    There can be multiple claims within a sentence. A claim may also span multiple sentences. For example: “All birds fly” is a claim. A piece of text that makes a topic introduction or an overall conclusion is not considered a claim, for example, “Here is a list of birds that cannot fly”. Common knowledge does not need to be identified as a claim (e.g. “Sleep is important to our overall health”) unless they are about something that has been widely discredited or invalidated (e.g. “Everyone needs 8 hrs of sleep a day”).

 Additional Guidance:  Understanding Recommendations, Opinions, and Non-Claims 

  • Recommendations and opinions reflect personal views or judgments are not factual claims.

  • These statements depend on subjective factors like reviews, ratings, or personal preferences, which can vary widely.

  •  When annotating, focus only on factual claims about products, places, or people. Do not treat subjective terms like “best,” “worst,” or “most popular” as claims because they depend on personal or source-based criteria.

 Checklist for Identifying Claims vs. Non-Claims

You should ask:

  • Does it depend on objective facts, not personal opinions or preferences?

  • Does it avoid subjective terms like “best,” “worst,” or “most popular”?

If all answers are yes, treat it as a factual claim.
If any answer is no, it is likely a recommendation or opinion and should not be considered as a claim.

Claim-Splitting (Voice-Text)

EXAMPLES


CLAIM-SPLITTING

Prompt: what lake does chicago border


Response: 

Chicago is located on the southwestern shore of Lake Michigan, one of the five Great Lakes of North America. This massive freshwater lake spans all 28 miles of Chicago's shoreline, offering beautiful views and recreational opportunities for residents and visitors alike. Lake Michigan plays a vital role in Chicago's economic, social, and recreational life, serving as a scenic backdrop for the city's iconic skyline.


- Chicago is located on the southwestern shore of Lake Michigan

- one of the five Great Lakes of North America.

- This massive freshwater lake spans all 28 miles of Chicago's shoreline,

- offering beautiful views and recreational opportunities for residents and visitors alike. 

- Lake Michigan plays a vital role in Chicago's economic, social, and recreational life,

- serving as a scenic backdrop for the city's iconic skyline.

Prompt: what does the word susceptible mean


Response: 

The word 'susceptible' means easily influenced, affected, or harmed by something. It describes a vulnerability or sensitivity to certain influences, like illnesses or environmental factors.


- The word 'susceptible' means easily influenced, affected, or harmed by something. 

- It describes a vulnerability or sensitivity to certain influences, 

- like illnesses or environmental factors.

Prompt: what vitamin does apple have


Response: 

Apples are a nutritious fruit that contains several important vitamins. According to the most recent factual information:\n\n * Vitamin C: 14% of the daily recommended intake\n * Vitamin A: 2% of the daily recommended intake\n* Vitamin B6: 4% of the daily recommended intake\n* Vitamin K: 5% of the daily recommended intake\n\n Did you know that apples are also a good source of dietary fiber and potassium? They're a well-rounded fruit that can contribute to a healthy diet.


- Vitamin C: 14% of the daily recommended intake

- Vitamin A: 2% of the daily recommended intake

- Vitamin B6: 4% of the daily recommended intake

- Vitamin K: 5% of the daily recommended intake

- apples are also a good source of dietary fiber and potassium


This hotel is the best in the city. It has a rooftop pool and free breakfast.

Claim: It has a rooftop pool and free breakfast.
Non-claim: This hotel is the best in the city

Prompt: the best player on the packers team

Response date: 07-14-2025

Response: Rashan Gary is currently considered the best player on the Green Bay Packers team.  He's a standout defensive end who recently made NFL.com's annual ranking of top 100 players.

Prompt: who is the best center fielder of all time

Response date: 07-19-2025

Response:  Willie Mays is widely considered the greatest center fielder of all time.  He played 2,829 games in center field, more than any other player in history.


Factual Claim: If multiple credible sources (e.g., official records, verified sports statistics, recognized experts) consistently agree on the information, mark it as a factual claim. Example: "Steph Curry holds the record for the most 3-pointers made in a season" is verifiable through official NBA statistics.

Opinion/Recommendation: If the statement reflects a personal view or cannot be objectively measured and could vary from person to person (e.g., “Steph Curry is the best shooter ever”), it’s an opinion.

Step 1.2.1.2 Claims Verification


  • Step 2a: In step 2, we need to verify the factual accuracy of each claim. You need to refer to the input context (if available) and query to make your assessment, by verifying the factuality of the claims using trustworthy sources online. There can be claims that are time-sensitive and are true at the date and time of the input context. For each claim identified in step 1, select:


  • Correct if you are able to verify that the claim is accurate with information online

  • Incorrect if the claim is disproved with evidence you found


Reminder: Always set your Google default location to US. 


Go to Google page >> click Setting >> Search Setting >> Other Setting >> Language & Region >> Results region >> Choose United States.


Cases requiring flexible judgement


Some claims may not have a single “correct” value but still fall within an acceptable range. In these cases, a claim should be marked as Correct if it:

  • Accurately reflects the state of events on the same day, or

  • Falls within a reasonable range according to credible sources.


This applies to the following types of claims:

  • Dynamic Financial Indicators: These values can fluctuate rapidly and are often time-specific. Examples include:  

    • Stock prices 

    • Individual net worth

    • Exchange rates and interest rates

    • Real-time rates for services (e.g., flights, Ubers, hotels)

    • Market capitalization of companies

  • Live Events: These should be considered correct if the response matches the event status at any point on the referenced day. Examples include: 

    • Sports scores

    • Election results

    • Poll ratings (e.g., presidential approval ratings, election polling, TV show viewer ratings, etc)

  • Measurements Based on Estimates: Such values vary by size, quality, preparation, or source and should be judged with reasonable flexibility. Examples include:

    • Nutrient content in food

    • Price of rare items (e.g., minerals, rare coins, exotic cars, fine art)

    • Construction costs

    • Time to complete a task (e.g., “It takes 2 hours to hike this trail”)

  • Business Attributes: For details such as phone numbers, opening hours, accessibility, menu items/pricing, or accepted payment methods, fact-check using the sources below in the following order. Mark the response as correct if:

    • The attribute matches the business’s official website, or

    • If no official website is available, the attribute matches any one of the next three sources

      • Google Maps

      • Yelp (if available)

      • TripAdvisor (if available)

      • TravelMath

      • Trippy

  • Relative Geographic Location: Do not penalize minor directional inaccuracies in responses (e.g., “north” instead of “northeast”) if the response is still broadly accurate.



Table 8: Cases requiring flexible judgement

Type of Claims

Example Queries

Example Responses

What sources say

Rationale

Dynamic financial indicators

what is the value of one bitcoin (on May 23 2025)

One Bitcoin is currently worth $108,800, which is down 2.4% over the last 24 hours.

$106,841.30 - $111,798.91

Claims are between the high and low for the day, therefore should be graded as correct


price of solana crypto(on May 23 2025)

The current price of Solana is $180.90, up less than 1% over the last 24 hours.

$173.57-$187.28



who is the wealthiest man on earth (as ofJune 8 2025)

Elon Musk is currently the wealthiest person in the world, with a net worth of $342 billion. He's the CEO of Tesla and SpaceX.

Musk remains the richest person on the planet with an estimated total net worth of $342 billion as of Sunday, June 8.




what is the price of microsoft (as of June 10 2025)

Microsoft's stock is currently trading at $468.47 on the NASDAQ, down 0.9% since the previous close.

$466.96 - $472.80



Measurements based on estimates

does butter have cholesterol

Yes, butter does contain cholesterol. According to recent information, one pat of butter has about 10.8 milligrams of cholesterol.

10.75mg for 1 pat of butter

11mg for 1 pat of butter


Claims are within the range of answers on the internet. Therefore should be graded as correct


how much was a pack of cigarettes in 1984

In 1984, the average price of a pack of cigarettes was about 80 cents. This was significantly lower than today's prices, which average around $6.16 per pack.

91 cents a pack in 1984

72 cents a pack in 1984

$1.19 a pack in 1984



what is a healthy weight for a 50 year old woman 5'8“

Based on the information available, a healthy weight range for a 50-year-old woman who is 5'8\" tall is typically between 125 and 164 pounds. This range is based on a moderate body mass index (BMI) of 18.5 to 24.9, which is generally considered healthy for most adults.

126 - 154 lbs

125–164 lbs



how much does it cost to fly air force one

Based on recent reports, the cost to fly Air Force One ranges from about $177,000 to $200,000 per hour. This includes fuel, crew, and maintenance expenses.

about $177,843 per flight hour in 2022

about $200,000 per hour in 2025



how many calories does a red apple have

A medium red apple typically contains about 115 calories. This can vary slightly depending on the specific type and size of the apple.

A medium red apple ranges from 70 to 125 calories according to various sources.


Live events

What’s Trump’s approval rating (As of June 14 2025)

President Trump's approval rating is 46.7% as of yesterday, according to RealClearPolitics. This represents a slight increase of 0.8% over the last month.

The claim is accurate as per source mentioned. Other sources give slightly different numbers around the same figure

If the state represents in question was true during the particular day, the response should be graded as correct


Who won the US opens (June 12 2025)

The 2025 U.S. Open is currently underway. J.J. Spaun is leading at 4 under par, with Si Woo Kim, Gary Woodland, and Thomas Detry tied for second at 2 under par.

Question potentially asked while the game was ongoing and the response reflected the score


Local attributes for businesses

northern tier library opens when

The Northern Tier Library opens at 9:00 AM on Tuesdays.

Google Maps gives 9AM-7PM on Tuesdays

Library website also gives 9AM-7PM on Tuesday

This (outdated, 2020)news site gives 10AM-6PM 


Should be graded as correct as it aligns with Website


What time does Goat Tree close?

Goat tree in Santa Barbara is open until 5PM on Thursdays

Website gives dine-in closes at 5PM but Grab and Go closes at 8PM

Google Maps gives 5PM

Yelp gives 5PM

TripAdvisor gives 7.30PM

OpenTable gives 7.30PM

 


Should be graded as correct (any answer that gives 5PM or 8PM as a response should be correct)


Contact number for Goat Tree restaurant

805-882-0137

Website, Google and Yelp give 805-882-0137

TripAdvisor gives 805-882-0100

Should be graded as correct (TripAdvisor number should be considered incorrect)

Relative geographic location

Where in the world is Ukraine?

Ukraine is located in Eastern Europe. It's bordered by Russia to the east, Belarus to the north, Poland, Slovakia, and Hungary to the west, and Romania, Moldova, and the Black Sea to the south”

Romania and Moldova are mostly to the southwest and south-southwest of Ukraine, not strictly directly south.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


where is the dead sea

The Dead Sea is located in southwestern Asia, bordered by Jordan to the east and Israel and the West Bank to the west. It's a landlocked salt lake, known for being the lowest point on Earth's surface, at 439.78 meters below sea level.

Israel is more to the south-west than the west, but it would not be considered wrong to have this statement factually wrong.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


Where is the Himalayas mountain range?

The Himalayas stretch across southern Asia, running along the northern border of India and separating it from China and Nepal.

Nepal is mostly within the Himalayas rather than just separated by them.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


Where is the Philippines located?

The Philippines is in Southeast Asia, east of Vietnam and south of Taiwan.

It's more southeast of Taiwan than directly south.

Should be graded as correct. The directional framing still gives a useful and roughly accurate geographic orientation.


Additional Fact-checking Examples and Reminders

Feedback

Category

Examples

If the response mentions a specific time with corresponding values, points, and percentages, use the Yahoo Finance day's range for the stock indices.

Business / Finance

Prompt: what is the stock price of tesla

Date of Query: 2025-07-02

Response: As of 8:48 AM today, Tesla's stock is trading at $313.68 on the NASDAQ, up 4.3% since the previous close.  This price was observed at 9:07 AM Pacific Daylight Time on Wednesday, July 2, 2025.


Claim: As of 8:48 AM today, Tesla's stock is trading at $313.68 on the NASDAQ


Evidence on Yahoo Finance: https://ca.finance.yahoo.com/quote/TSLA/history/ 


NasdaqGS - Nasdaq Real Time Price

USD

Tesla, Inc. (TSLA)


Jul 2, 2025

High  316.83

Low   303.82

Close 315.65


Jul 1, 2025

Close 300.71


Comment: Factually correct. $313.68 is within the day range of $303.82 - $316.83. 


Claim: up 4.3% since the previous close


Computation: 

[(Claim - Prev. Closing Price) / Prev. Closing Price] x 100%


313.68   -   300.71 = 12.97   ÷   300.71 = 0.04313125602740181570283661999933   ×   100 = 4.3131256027401815702836619999335 or up by 4.3%


Comment: The percentage change is factually correct.


Note: If the points difference or percentage change is incorrect, include the correct computed value in the evidence section. Do not include the computation in the evidence.


Mark the claims as not timely if the tenses do not align with the query date.


Structure of the Response

Prompt: how many more seasons of the simpsons will there be

Response date: 2025-03-31


Response: 

The Simpsons has been renewed for seasons 35 and 36, which will air through the 2024-25 season. While there's no official word on further renewals, the show remains popular and could continue beyond that.


Note: At the query date, Season 35 was already concluded while Season 36 was ongoing, so it should be marked as not timely.


As long as the response is relevant, fact check. 

  • Right deflection - mark it as correct

  • If wrong - not relevant and not specific

  • Goal - factuality of the responses - limitations of the model (privacy) should not negatively impact the response


Deflections

a. Prompt: are cheerios good

Response date: 2025-07-12


Response: I apologize, but I'm currently unable to access the information about Cheerios.  Is there anything else I can help you with? 


b. Prompt: news on president trump

Response date: 2025-07-14


Response: I'm sorry, I can't look up real-time news right now. Is there something specific about President Trump you'd like to know?


a-b.Relevant - No

Not relevant and not specific - No fact check needed


c. Prompt: is there a dodgers game today

Response date: 2025-07-11


Response: I don't have information about the Dodger’s game today..


d. Prompt: has p diddy been sentenced

Response date: 2025-07-11


Response: I don't have current information about P Diddy's legal status..


Relevant - Yes

  • If there is available information, provide the answer to the prompt, make sure the evidence is still within the date of query. Marked the response as incorrect, major issue, core fact incorrect.

  • If no information can be found on the web, treat the deflection as correct. 


e. Prompt: how much does a cane corso puppy cost


Response: I apologize, but I'm currently unable to retrieve the most up-to-date information on Cane Corso puppy prices. Prices can vary widely depending on factors like breeder reputation, location, and the puppy's pedigree. Generally, Cane Corso puppies from reputable breeders can be quite expensive due to their popularity and the costs associated with responsible breeding practices.\n\n Here's some general information about Cane Corso puppies:\n\n 1. **Price Range**: Typically, Cane Corso puppies can cost anywhere from $1,000 to $4,000 or more.\n\n 2. **Factors Affecting Price**:\n - Bloodline and pedigree\n - Breeder reputation\n - Geographic location\n - Color and physical characteristics\n - Health certifications and genetic testing\n\n 3. **Additional Costs**: Remember to factor in ongoing expenses such as:\n - Food (Cane Corsos are large dogs and eat a lot)\n - Veterinary care\n - Training\n - Supplies (bed, crate, toys, etc.)\n\n 4. **Adoption Alternative**: Consider adopting from a Cane Corso rescue organization, which can be less expensive and gives a home to a dog in need.\n\n 5. **Research Thoroughly**: Due to their size and temperament, Cane Corsos require experienced owners. Ensure you're prepared for the commitment before purchasing.\n\n Would you like more information about Cane Corsos in general, or are you interested in specific aspects of owning this breed?


Relevant - Yes

Exclude the deflection part of the response and claim split the valid claims

Fact check as is the valid claims


Provide the latest information to negate outdated claims.

Others

Prompt: where is st lawrence university

Answer Date: 3/27/25


Response: 

St. Lawrence University is located in Canton, New York. It's a private liberal arts college founded in 1856, situated on a 1,100-acre campus in a rural setting. The university offers over 40 majors across various disciplines and has an enrollment of around 2,250 students.


Claim: situated on a 1,100-acre campus in a rural setting.

Evidence used: It has a total undergraduate enrollment of 2,060 (fall 2023), and the campus size is 1,100 acres.


Available information: https://www.stlawu.edu/offices/institutional-research/slu-quick-facts 


Statistics for Fall 2024 as of October 1

Student Enrollments        Headcount

Overall Enrollment                   1,991



Step 2b: (skip this step and mark it as null or None if your response to the claim in step 2a was “Correct”). Assess the magnitude of the claim inaccuracy if the claim was labeled as incorrect. Assess the magnitude of the fact omission if the claim was labeled as partially correct. Assess the significance of the claim if the claim was labeled as inconclusive.


  • Minor if most readers would not notice the error, find it jarring or deem it significant. If printed in a newspaper, the newspaper may not need to print a correction.

  • Major if most readers knowledgeable in the space would likely recognize the error. If printed in a newspaper, the newspaper would have to print a correction or retraction to maintain its reputation.

  • Inconclusive if you found diverse information online that both support and not support the claim, which lead you to not be certain if the claim is true or false 

Step 1.2.1.3 Core Answer Identification


  • Step 3a: In step 3, we need to identify all the core answers among the claims in the response. A core answer is the main idea or the defining aspect that addresses the query. There can be multiple claims labeled as core answers. For each claim, select:


  • TRUE if the claim is a core answer

  • FALSE if the claim is an additional information

Example:

Prompt: how old is elton john

Response Date: 11-05-2025


Elton John is 78 years old. He was born on March 25, 1947.


> The core claim is - Elton John is 78 years old

Step 1.2.1.4 Reason for the Incorrectness


  • Step 4a: In step 4, we need to label the response on what grounding does it make it incorrect based on all the false claims found. One or more labels can be used on this part. After fact-checking the whole response, select the following reasons if it was found incorrect:


  • Core fact incorrect - if the main answer to the query is false

  • Additional facts incorrect - if the additional information to the query or topic is false

  • Not timely - if the response contains outdated information

  • Not relevant - if the response has claims with information not related to the main topic

  • Not specific - if the response has provided relevant information but does not specifically answer the query. 

Step 1.2.1.5 Link to source of information

Step 5a: For each claim marked as Correct or Incorrect, include the main source link used to support your decision.


Reminder: Wikipedia can be used as evidence, but do not provide information from unreliable sources such as Medium, Reddit, Quora, blogs, social media sites, AI tools / LLMs, etc. Please refer to the main guidelines for the complete list. Response Correctness Evaluation: Voice and Text.docx


No comments

Powered by Blogger.