> For example, if I ask it to first look for people with experience in Hadoop and then calculate for each of them how many years of experience they have in the field (I have a tool designed for basic info search and one designed for calculating periods of time), it will output a list with an answer to the first question, but will forget the second one.
>
> agents' behavior is often unpredictable; current tooling doesn't help too much
The only models that kinda pull this off are: Sonnet 3.7+, Gemini 2.5 pro/flash, Grok 4+, GPT-5. of these, GPT-5 is a step up, actually does quite a good job. Others fail often and need to fall back on each other. gpt 4.1, gpt-4o, o4-mini, gpt-5-mini fail more often than they work. Qwen 3 almost doesn't work at all, let alone any other Chinese models.
> For example, if I ask it to first look for people with experience in Hadoop and then calculate for each of them how many years of experience they have in the field (I have a tool designed for basic info search and one designed for calculating periods of time), it will output a list with an answer to the first question, but will forget the second one.
>
> agents' behavior is often unpredictable; current tooling doesn't help too much
The only models that kinda pull this off are: Sonnet 3.7+, Gemini 2.5 pro/flash, Grok 4+, GPT-5. of these, GPT-5 is a step up, actually does quite a good job. Others fail often and need to fall back on each other. gpt 4.1, gpt-4o, o4-mini, gpt-5-mini fail more often than they work. Qwen 3 almost doesn't work at all, let alone any other Chinese models.