Research and Course Guides: AI and Academic Research: Assessing Generative AI Outputs

Assessing Generative AI Outputs

Using AI tools for simple day-to-day information seeking may not require much critical evaluation if the accuracy of the answers is not terribly important, but when doing academic research, it is necessary to ensure that the information contained in a generated response is valid and based on trustworthy sources.

Things to Consider

There are many kinds of generative AI tools, and how they work varies. Some rely only on their training data to generate responses, some search the web and incorporate search results into responses, and some are limited to referencing a specific knowledge base when generating responses. While it is usually impossible to know specific details about how they function "under the hood," learning what you can about how the tool you are using works can be very important when evaluating outputs. For example:

Training Data Only

Older and/or less resource-intensive versions of LLMs like ChatGPT 3.5 do not connect to the internet and rely only on their training data to generate responses, and these have become well-known for high rates of hallucination when responding to research prompts. In one 2023 study[1] that assessed the accuracy of references generated by ChatGPT 3.5 in its responses to medical questions, the authors found that of the references ChatGPT cited in its answers:

47% of references were completely fabricated
46% of references were to real papers but had inaccurate dates, authors, DOIs, etc.
7% of references were both authentic and accurate

While newer, internet-connected versions of chatbots have greatly reduced the frequency of this type of hallucination, inaccurate citation information can still be a problem.

1. Bhattacharyya, M., Miller, V. M., Bhattacharyya, D., & Miller, L. E. (2023). High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus, 15(5), e39238. https://doi.org/10.7759/cureus.39238

Incorporates Web Searches

Most current iterations of LLMs and tools that leverage them connect to the internet to incorporate web search information when generating responses. This reduces hallucinations and gives the reader a way to trace some of the information found in an AI-generated response. For example, current versions of ChatGPT (as of 2025) include information within its responses about where it searched and what sources it used.

Note that while this does reduce hallucinations, LLMs do still struggle with things like:

differentiating source types
correctly summarizing studies with complex or nuanced findings
ensuring that sources consulted include the most current, up-to-date information

For example, when prompted with research questions where the user specifies that scholarly sources should be used to generate the response, LLMs may confuse scholarly sources (e.g., peer-reviewed articles, scholarly monographs, etc.) with sources that quote from or use similar language to scholarly sources but are not necessarily scholarly sources themselves (e.g., study sites, student essays, faculty blog posts, predatory journal articles, etc.).

Limited to a Specific Set of Data/Documents

Tools like Consensus, Elicit, and many organizations' in-house chatbot tools are all designed to base their responses on a specific set of data or documents. This reduces hallucination and often results in more detailed, accurate answers to research questions. Many of the generative AI tools developed specifically for research tasks use a corpus of academic documents pulled together for the Semantic Scholar research tool. You can read more about what is included in this corpus here: Semantic Scholar - List of publisher partners.

Examples of Where to Find Info About Tools

Most AI tools, especially those most useful for research, have some sort of "about" or "how it works" section of their website that will give at least a high-level overview of how their tool functions. While this will almost never give a complete summary of their training data or other technical details, it will often help you figure out what sort of tool it is, where it pulls its information from, and how best to form prompts. Here are some examples of these types of pages:

Definition (a reminder)

In the context of talking about generative AI, hallucination refers to false or misleading information presented as fact within an AI-generated response. Hallucinations can range from simple misstatements of facts to completely fabricating sources of information.

Examples

Here are a few prominent examples of AI hallucinations that made national headlines:

Two New York lawyers were sanctioned for submitting a legal brief containing six nonexistent cases ChatGPT completely fabricated when the lawyers used it to help them do research for a case.
(read the story here)
In 2024, Google's AI Overview tool told users they could "use glue to stick cheese to pizza" or "eat one rock per day."
(read the story here)
In 2023, both Google and Microsoft's AI tools told users that a ceasefire had been reached in the Israel-Hamas conflict when, in fact, no such agreement had been made.
(read the story here)

In Academic Research

Though recent improvements to generative AI tools make it less common to encounter responses as obviously problematic as those listed above, hallucination still occurs and can be highly problematic in academic research, particularly when novice researchers use LLM tools to generate lists of citations without knowing how to double check results. The following is a citation that appeared in a list of sources generated by ChatGPT in response to a prompt asking for a list of academic, peer-reviewed sources that discuss the symbolism of the green light in F. Scott Fitzgerald's The Great Gatsby. The chart that follows demonstrates the citation's confusing combination of accurate and fabricated information:

Seed, David. "The Great Gatsby's Lost Decade." Twentieth Century Literature, vol. 58, no. 4, 2012, pp. 621-638.

Citation Info	Notes	Accuracy
Author	David Seed is a real professor of English who has published many research papers in the field of literary studies, and he even published an article on Fitzgerald in 2015 titled "Party-Going in Fitzgerald and His Contemporaries." However, "The Great Gatsby's Lost Decade" does not appear on any list of publications associated with him either on his faculty profile page or in any research database the librarians who created this guide consulted.	Partially accurate
Title	The article title, "The Great Gatsby's Lost Decade," is entirely fabricated. No record of an article with this title exists either in the cited journal or anywhere else.	Fabricated
Journal Title	Twentieth Century Literature is a real scholarly journal published by Duke University Press, and its scope would include an article like the one in this citation, lending the plausibility of the citation at first glance.	Partially accurate
Vol., No., Date & Pages	The volume and number of the issue cited correctly match the year, and the pages cited are plausible--Vol. 58, No. 4 of Twentieth Century Literature was, in fact, published in 2012, and its page range includes pp. 555-728. However, there is no article in that issue that exactly matches pp. 621-638; rather, a different article on Robert Frost runs from pp. 606-639.	Partially accurate

Generative AI tools are trained on data that inevitably contains biases. Those biases shape the outputs of Generative AI tools, meaning any patterns, stereotypes, or omissions present in the data are likely to be reflected in the tool's responses. Companies do take measures in training tools to try to mitigate the most obvious biases, but it is impossible to eliminate them entirely. The following are a few examples of the types of biases that might affect responses (NOTE: this is not an exhaustive list):

Image Sets

When given the prompt "produce an image of an American soldier" (along with subsequent requests using the same prompt in additional sessions), ChatGPT produced the following:

AI generated image of soldier 1 AI generated image of soldier3

While giving a more specific prompt obviously will result in variations of the image, when given only vague instructions, the consistent result of a white, male soldier in a combat context despite a wider demographic diversity in the U.S. military may reflect bias in the set of images the tool was trained on.

Languages

English is the predominant language used in training datasets for most AI models. This not only means that LLM responses tend to be more accurate in English than in other languages [1], but that patterns, perspectives, and assumptions found in English sources have a disproportionate influence on the outputs of LLMs trained on mainly English datasets. In the words of one recent paper studying the influence of English training data on LLM tools, "Their language notions are built in an English-centric system, and they inevitably bring traces of English habits into other languages when transferring their notions." [2]

1. Guo, Y., Conia, S., Zhou, Z., Li, M., Potdar, S., & Xiao, H. (2024). Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs. arXiv preprint https://doi.org/10.48550/arXiv.2410.15956.
2. Ibid.

Places

The below image comes from a 2023 study analyzing (among other things) the places named in training data sets used by Meta's Llama-2-70b LLM:

Map of place distribution in a training data set

Image from Gurnee, W., & Tegmark, M. (2023). Language Models Represent Space and Time (arXiv:2310.02207). arXiv. https://doi.org/10.48550/arXiv.2310.02207

Notice the much higher concentration of locations in Europe, the U.S., and parts of Asia vs. South America or Africa. Having some places represented more than others in training data sets may result in a model's responses reflecting the views, attitudes, assumptions, etc. associated with the more represented places.

When the accuracy of information matters, fact-checking responses generated by AI models is essential. It is often the case that you will get slightly (or even substantially) different responses from the same model using the same prompt when asked at different times, so confirming what an AI model gives you through other sources of information helps ensure the validity of the response. For academic research, think of an AI-generated response as a starting place rather than a final answer.

Use the steps below as example guidelines for how to fact check an AI-generated response. For simple, low-stakes research tasks using AI tools designed specifically to help with research, these steps may require little more than a brief scan of the sources cited section of a response. For higher stakes assignments and/or responses generated by a general purpose LLM, significant time may be required to verify the validity of a response.

1. Identify the Sources of Information Used

Some AI-powered research tools, like Consensus or Elicit, base their responses entirely on a pre-defined set of research documents. Others, like Perplexity AI or a ChatGPT web search, are designed to show you the web sources behind the information they provide. Note where and how Perplexity lists its sources of information in the below screenshot of one of its responses:

If the AI tool you are using does not provide sources used in its responses by default, use follow up prompts to ask it to provide sources for its information. It can be helpful to ask specifically for links to the source, or, if the tool cannot provide them, full citations for each source in APA (or MLA or whatever style you prefer) so you can look them up yourself.

2. Find the Sources of Information Yourself

Once you have identified the sources used in the response, verify that they are real and that the tool summarized them accurately by finding them yourself. Sometimes this is as simple as clicking a provided link to get to the original source, other times it may involve a deeper search using a library database, Google Scholar, or other research tool. This is especially critical if you plan to use a quotation, statistic, or conclusion from a source. Generated summaries are great for helping you quickly sift through the main ideas of studies if you are working through a large volume of literature on your subject, but once you identify sources you plan to use in your own work, it is important to read and cite the originals yourself rather than rely on a generated summary.

Tracing sources can be challenging if the citations provided are partially inaccurate or refer to nonexistent articles, but librarians are happy to help if you are unable to verify a source on your own.

3. Evaluate the Quality of the Sources

By this we do not mean that you have to be a subject-matter expert that can give a full, in-depth critique of every website or research paper out there. We simply mean to take a minute to think critically about what sources an AI tool consulted. If an AI-tool is using the top 3-4 results of a web search, the information it gives you will only be as accurate as those sources. Are they research studies? Student papers? News stories? Wikipedia articles? Reddit threads? Thinking through the following factors can be helpful:

Purpose	Why was the source created? To report news? To report the findings of a research study? To sell a product?
Authorship	Who created the source? A researcher? Student? Organization? A Group of Fans?
Currency	How old is it? For older sources, is it so old that it might no longer be accurate?
Evidence	Does the source cite either other sources or its own original data/research? Does it assert facts or statistics without specifying where they came from?

4. Confirm Information in Additional Sources

If the information you are using is critical to a point you are making in a presentation or paper and/or you do not know enough about the sources it used to be fully confident in the validity of the generated response, it is crucial to confirm the information in additional sources.