LLM-Comparison

LLM comparison - Groq LLaMa 3, GPT 3.5 Turbo, GPT 4o

🇨🇳简体中文

This article will briefly compare the differences between the three models in terms of average processing time, failure rate, and query length and processing time growth.

Parameter settings

The three models compared are

models = ["llama3-70b-8192", "gpt-3.5-turbo-0125", "gpt-4o-2024-05-13"]

Among them, llama3-70b-8192 uses the Groq API, and gpt uses the OpenAI API. The API usage is Python Client Library, as follows

groq_client = Groq(
api_key=os.environ["GROQ_API_KEY"],
)
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

The test set will use the first 50 entries in a file called logs.json and convert them into a messages list that can be read by the large model. The logs.json file is not public.

Test process

The test will use a loop to go through the test set one by one, and record the query length, processing time, and failure rate. Every 10 queries will be paused for 60 seconds to avoid problems such as exceeding the request rate limit. The test set does not include the same text, so there is no cache problem in theory.

Results

Data summary:

output1

output2

output3

Model Average time Fail rate Overall query
llama3-70b-8192 1.4226796483993531 0.0 358776
gpt-3.5-turbo-0125 3.4280353021621703 0.0 358776
gpt-4o-2024-05-13 4.238370051383972 0.0 358776

Detailed data (see results.json)

results.json

Final conclusion evaluation

  1. Average response time
  1. Failure rate
  1. Relationship between query length and response time

Overall evaluation

Overall, llama3-70b-8192 is the best choice, with the fastest response speed and stable performance.

Limitations

The limitations of this test include but are not limited to the following:

  1. Limited test set size: This test only uses the first 50 queries in the file named logs.json, which may not fully represent the query situation in real scenarios.
  2. Text quality: This test only evaluates texts with a return field length greater than 1 as valid results, and does not perform an exhaustive test of text quality.
  3. Client: The test used Groq Python Client and OpenAI Python Client. Using other methods (such as POST requests) may lead to different results.

Please note that the above limitations are only for the specific conditions and settings of this test. In actual applications, there may be other limitations that require further consideration and evaluation.


LLM-Comparison by Haozhe Li is licensed under CC BY-NC 4.0