#llm-heuristics | Explore Tumblr posts and blogs

canmom · 1 year ago

Text

i was going around thinking neural networks are basically stateless pure functions of their inputs, and this was a major difference between how humans think (i.e., that we can 'spend time thinking about stuff' and get closer to an answer without receiving any new inputs) and artificial neural networks. so I thought that for a large language model to be able to maintain consistency while spitting out a long enough piece of text, it would have to have as many inputs as there are tokens.

apparently i'm completely wrong about this! for a good while the state of the art has been using recurrent neural networks which allow the neuron state to change, with techniques including things like 'long short-term memory units' and 'gated recurrent units'. they look like a little electric circuit, and they combine the input with the state of the node in the previous step, and the way that the neural network combines these things and how quickly it forgets stuff is all something that gets trained at the same time as everything else. (edit: this is apparently no longer the state of the art, the state of the art has gone back to being stateless pure functions? so shows what i know. leaving the rest up because it doesn't necessarily depend too much on these particulars)

which means they can presumably create a compressed representation of 'stuff they've seen before' without having to treat the whole thing as an input. and it also implies they might develop something you could sort of call an 'emotional state', in the very abstract sense of a transient state that affects its behaviour.

I'm not an AI person, I like knowing how and why stuff works and AI tends to obfuscate that. but this whole process of 'can we build cognition from scratch' is kind of fascinating to see. in part because it shows what humans are really good at.

I watched this video of an AI learning to play pokémon...

youtube

over thousands of simulated game hours the relatively simple AI, driven by a few simple objectives (see new screens, level its pokémon, don't lose) learned to beat Brock before getting stuck inside the following cave. it's got a really adorable visualisation of thousands of AI characters on different runs spreading out all over the map. but anyway there's a place where the AI would easily fall off an edge and get stuck, unable to work out that it could walk a screen to the right and find out a one-tile path upwards.

for a human this is trivial: we learn pretty quickly to identify a symbolic representation to order the game world (this sprite is a ledge, ledges are one-way, this is what a gap you can climb looks like) and we can reason about it (if there is no exit visible on the screen, there might be one on the next screen). we can also formulate this in terms of language. maybe if you took a LLM and gave it some kind of chain of thought prompt, it could figure out how to walk out of that as well. but as we all know, LLMs are prone to propagating errors and hallucinating, and really bad at catching subtle logical errors.

other types of computer system like computer algebra systems and traditional style chess engines like stockfish (as opposed to the newer deep learning engines) are much better at humans at this kind of long chain of abstract logical inference. but they don't have access to the sort of heuristic, approximate guesswork approach that the large language models do.

it turns out that you kind of need both these things to function as a human does, and integrating them is not trivial. a human might think like 'oh I have the seed of an idea, now let me work out the details and see if it checks out' - I don't know if we've made AI that is capable of that kind of approach yet.

AIs are also... way slower at learning than humans are, in a qualified sense. that small squishy blob of proteins can learn things like walking, vision and language from vastly sparser input with far less energy than a neural network. but of course the neural networks have the cheat of running in parallel or on a faster processor, so as long as the rest of the problem can be sped up compared to what a human can handle (e.g. running a videogame or simulation faster), it's possible to train the AI for so much virtual time that it can surpass a human. but this approach only works in certain domains.

I have no way to know whether the current 'AI spring' is going to keep getting rapid results. we're running up against limits of data and compute already, and that's only gonna get more severe once we start running into mineral and energy scarcity later in this century. but man I would totally not have predicted the simultaneous rise of LLMs and GANs a couple years ago so, fuck knows where this is all going.

#ai #Youtube

12 notes · View notes

kirabug-tumbles · 3 days ago

Text

I’m a UX designer - one of the jobs that is “being replaced” by chatGPT. My job is to talk to the product owners to get the business goals, talk to the end users to get the user goals, measure usage and feedback about current and future designs, make sure that every element on the page works the way users expect it to, make sure it’s accessible for people with disabilities, make sure it looks attractive and enticing and comfortable to use, etc. etc.

I’ve been told that LLMs can replace my job. They can tell you what roughly 80% of the people would say (but not for our specific people, just “people”) and they can design an interface based on what most of the interfaces they’ve been trained on would do.

Can they build an address form? Yeah, probably, for a standard US address that accepts any type of address.

Can they build an address form that works for retirees that have both international and US addresses because they “snowbird” to a different country every year, accept the dates for when they leave one address to go to the other and vice versa, reject PO Boxes, suggest address adjustments based on USPS standards, and compare international addresses against countries we’re not allowed to mail to? Hell no.

Can they handle error messaging? For one error yes but for the entire form validation process my developers are way better than the LLM and English is their second language.

The thing is, jobs are hard. They’re supposed to be hard. You wouldn’t be getting paid if they were easy.

And yes, there were a bunch of developers (and designers) in some FAANG* companies sitting on their hands getting paid for not doing a lot of work during COVID. That doesn’t mean their work was easy. It means they were under-utilized. And usually that’s more about bad management than the tech people.

To get where I am as a six-digit-salary UX designer I got a bachelor’s degree in English, a master’s degree in software engineering, seven years in customer support, and close to 20 years in UX design. I’m currently studying for my IAACP certification in accessibility. And yeah, there are a lot of things I don’t need to research anymore because I’ve done the work (or worked with amazing researchers) so many times that I’m probably going to guess right the first time.

But if someone called me tomorrow and asked me to do a user interview or a usability test, audit a content library, test a site for accessibility, do a heuristic evaluation of an interface, pick some colors for a brand, or debug vanilla JavaScript, I’ve done the work, I know both the concepts and the theory, and at worst I need to dust off my notes.

An LLM can tell you what Fitt’s Law is. It can even tell you when to use it. It can’t turn around and design an interface where it intentionally uses Fitt’s Law. It can’t look at a design and say “the problem here is these two controls are too far apart.”

That’s what humans are for.

As for what happens if you come in to our line of work and can’t do the work, well, either a company will see something in the business arrangement that makes it worth their while to train you up (unlikely) or they will let you go. And you will become one of thousands of people bitching in social media that you can’t find a way to break in to UX and that we’re gatekeeping.

Yeah, we’re gatekeeping on skill. You need to put in the work.

Why are you using chatgpt to get through college. Why are you spending so much time and money on something just to be functionally illiterate and have zero new skills at the end of it all. Literally shooting yourself in the foot. If you want to waste thirty grand you can always just buy a sportscar.

#you guys know that the purpose of college is to learn how to actually do something right #like to build a specific skill #what do you think will happen if you enter the workforce in a skilled job and dont have that skill

21K notes · View notes

govindhtech · 1 month ago

Text

AlphaEvolve Coding Agent using LLM Algorithmic Innovation

AlphaEvolve

Large language models drive AlphaEvolve, a powerful coding agent that discovers and optimises difficult algorithms. It solves both complex and simple mathematical and computational issues.

AlphaEvolve combines automated assessors' rigour with LLMs' creativity. This combination lets it validate solutions and impartially assess their quality and correctness. AlphaEvolve uses evolution to refine its best ideas. It coordinates an autonomous pipeline that queries LLMs and calculates to develop algorithms for user-specified goals. An evolutionary method improves automated evaluation metrics scores by building programs.

Human users define the goal, set assessment requirements, and provide an initial solution or code skeleton. The user must provide a way, usually a function, to automatically evaluate produced solutions by mapping them to scalar metrics to be maximised. AlphaEvolve lets users annotate code blocks in a codebase that the system will build. As a skeleton, the remaining code lets you evaluate the developed parts. Though simple, the initial program must be complete.

AlphaEvolve can evolve a search algorithm, the solution, or a function that creates the solution. These methods may help depending on the situation.

AlphaEvolve's key components are:

AlphaEvolve uses cutting-edge LLMs like Gemini 2.0 Flash and Gemini 2.0 Pro. Gemini Pro offers deep and insightful suggestions, while Gemini Flash's efficiency maximises the exploration of many topics. This ensemble technique balances throughput and solution quality. The major job of LLMs is to assess present solutions and recommend improvements. AlphaEvolve's performance is improved with powerful LLMs despite being model-agnostic. LLMs either generate whole code blocks for brief or completely changed code or diff-style code adjustments for focused updates.

Prompt Sample:

This section pulls programs from the Program database to build LLM prompts. Equations, code samples, relevant literature, human-written directions, stochastic formatting, and displayed evaluation results can enhance prompts. Another method is meta-prompt evolution, where the LLM suggests prompts.

Pool of Evaluators

This runs and evaluates proposed programs using user-provided automatic evaluation metrics. These measures assess solution quality objectively. AlphaEvolve may evaluate answers on progressively complicated scenarios in cascades to quickly eliminate less promising examples. It also provides LLM-generated feedback on desirable features that measurements cannot measure. Parallel evaluation speeds up the process. AlphaEvolve optimises multiple metrics. AlphaEvolve can only solve problems with machine-grade solutions, but its automated assessment prevents LLM hallucinations.

The program database stores created solutions and examination results. It uses an evolutionary algorithm inspired by island models and MAP-elites to manage the pool of solutions and choose models for future generations to balance exploration and exploitation.

Distributed Pipeline:

AlphaEvolve is an asynchronous computing pipeline developed in Python using asyncio. This pipeline with a controller, LLM samplers, and assessment nodes is tailored for throughput to produce and evaluate more ideas within a budget.

AlphaEvolve has excelled in several fields:

It improved hardware, data centres, and AI training across Google's computing ecosystem.

AlphaEvolve recovers 0.7% of Google's worldwide computer resources using its Borg cluster management system heuristic. This in-production solution's performance and human-readable code improve interpretability, debuggability, predictability, and deployment.

It suggested recreating a critical arithmetic circuit in Google's Tensor Processing Units (TPUs) in Verilog, removing unnecessary bits, and putting it into a future TPU. AlphaEvolve can aid with hardware design by suggesting improvements to popular hardware languages.

It sped up a fundamental kernel in Gemini's architecture by 23% and reduced training time by 1% by finding better ways to partition massive matrix multiplication operations, increasing AI performance and research. Thus, kernel optimisation engineering time was considerably reduced. This is the first time Gemini optimised its training technique with AlphaEvolve.

AlphaEvolve optimises low-level GPU operations to speed up Transformer FlashAttention kernel implementation by 32.5%. It can optimise compiler Intermediate Representations (IRs), indicating promise for incorporating AlphaEvolve into the compiler workflow or adding these optimisations to current compilers.

AlphaEvolve developed breakthrough gradient-based optimisation processes that led to novel matrix multiplication algorithms in mathematics and algorithm discovery. It enhanced Strassen's 1969 approach by multiplying 4x4 complex-valued matrices with 48 scalar multiplications. AlphaEvolve matched or outperformed best solutions for many matrix multiplication methods.

When applied to over 50 open mathematics problems, AlphaEvolve enhanced best-known solutions in 20% and rediscovered state-of-the-art solutions in 75%. It improved the kissing number problem by finding a configuration that set a new lower bound in 11 dimensions. Additionally, it improved bounds on packing difficulties, Erdős's minimum overlap problem, uncertainty principles, and autocorrelation inequalities. These results were often achieved by AlphaEvolve using problem-specific heuristic search strategies.

AlphaEvolve outperforms FunSearch due to its capacity to evolve across codebases, support for many metrics, and use of frontier LLMs with rich context. It differs from evolutionary programming by automating evolution operator creation via LLMs. It improves artificial intelligence mathematics and science by superoptimizing code.

One limitation of AlphaEvolve is that it requires automated evaluation problems. Manual experimentation is not among its capabilities. LLM evaluation is possible but not the major focus.

AlphaEvolve should improve as LLMs code better. Google is exploring a wider access program and an Early Access Program for academics. AlphaEvolve's broad scope suggests game-changing uses in business, sustainability, medical development, and material research. Future phases include reducing AlphaEvolve's performance to base LLMs and maybe integrating natural-language feedback approaches.

#AlphaEvolve #googleAlphaEvolve #codingagent #AlphaEvolveCodingAgent #googleCodingAgent #largelanguagemodels #technology #technologynews #technews #news #govindhtech

0 notes

realbadatpoker · 2 months ago

Text

Imma be honest with you, the example this rando uses to "prove" the AI is faking it is exactly how I would do that math, and that is exactly the lie I would tell if you asked me to explain it. This is 100% me-level agi.

This is the process the LLM used to solve: 36+59 = 95:

We now reproduce the attribution graph for calc: 36+59=. Low-precision features for “add something near 57” feed into a lookup table feature for “add something near 36 to something near 60”, which in turn feeds into a “the sum is near 92” feature. This low-precision pathway complements the high precision modular features on the right (“left operand ends in a 9” feeds into “add something ending exactly with 9” feeds into “add something ending with 6 to something ending with 9” feeds into “the sum ends in 5”). These combine to give the correct sum of 95.

This process represents a series of heuristics and lookup tables of memorized patterns. So, when the LLM is asked to describe the method it used to solve the calculation, it responds with this:

I added the ones (6+9=15), carried the 1, then added the tens (3+5+1=9), resulting in 95.

#ai #agi #artificial intelligence #artificial general intelligence #robot apocalypse #mindprison #Anthropic #claude #math #lies

0 notes

cybereliasacademy · 1 year ago

Text

Navigation with Large Language Models: Discussion and References

Subscribe .t1f01df44-f2da-4be0-b7c8-94943f1b14e8 { color: #fff; background: #222; border: 1px solid transparent; border-radius: undefinedpx; padding: 8px 21px; } .t1f01df44-f2da-4be0-b7c8-94943f1b14e8.place-top { margin-top: -10px; } .t1f01df44-f2da-4be0-b7c8-94943f1b14e8.place-top::before { content: “”; background-color: inherit; position: absolute; z-index: 2; width: 20px; height: 12px; }…

View On WordPress

#goal-directed-exploration #language-frontier-guide #large-language-models #llm-heuristics #navigation-with-llm #polling-llms #scoring-subgoals #semantic-scene-understanding

0 notes

urpriest · 3 months ago

Text

"Intelligence is computable"

You know, there are a lot of variants of that statement that I would agree with, but I don't think I can agree with that one.

Certainly, some of what we do with intelligence is computable. When we say a person is intelligent, we're talking about a bunch of different capabilities that tend to be correlated in humans, and some of those capabilities correspond pretty directly to computation: recall, valid reasoning, following instructions, etc.

But most of the things we attribute to human intelligence, and most of the things AI research tries to create, are not actually computable. Computability is a matter of whether an algorithm can solve a problem. But LLMs aren't algorithms, and for the most part humans aren't either. We, and they, are heuristics. And heuristics don't solve problems, they don't compute. They happen to get the right answer sometimes, enough to be useful. But they don't come with the kinds of guarantees that computability in the technical sense demands. The thing they are trying to do is, itself, not a computable function, often because it is not a function to begin with. Language use is a great example: sentences are ambiguous! We interpret them in an assumed context, and each of us assumes something subtly different. There isn't a "right way" to interpret a text, just ways a given human might be more or less likely to interpret it.

(There's also a border of questions where you can characterize a right answer, but neither humans nor any technology we have access to has the computational power to solve it, like throwing something to hit a target under realistic conditions. Humans can't actually do this in the computational sense, they just have heuristics that are ok enough at it most of the time.)

I certainly would agree that humans are simulable, and even that one could have a machine reproduce the valuable parts of human behavior without directly simulating a human being. But I think it's a mistake to think of that in terms of computability. The behavior of any given human is computable. But intelligence goes beyond that, it's an assumed goal of the human heuristics overall, and it's a goal that we only know how to approach with heuristics, human or artificial, not with algorithmic computation.

I keep chewing on this topic but people really seem to struggle to accept that intelligence is computable (in the technical sense of the term), presumably for the same reason they struggle to accept that humans are made of atoms (or "humans are complex assemblages of molecules").

but the consequence of this worldview is that anything that can be computed is not intelligence, so we end up with this god of the gaps situation where the definition of intelligence is constantly shrinking to exclude things that computers can now do.

a lot of religious people just say intelligence = soul and end the discussion there, which is obviously crazy but actually more defensible than the muddled middle who theoretically accept how the universe works but in practice retreat into vague platitudes that make no sense at all.

205 notes · View notes

ai-news · 8 months ago

Link

A key question about LLMs is whether they solve reasoning tasks by learning transferable algorithms or simply memorizing training data. This distinction matters: while memorization might handle familiar tasks, true algorithmic understanding allows f #AI #ML #Automation

#Blockchain #Crypto

0 notes

tutorsindia152 · 1 year ago

Text

A comparison of a literature review by ChatGPT and expert writers

ChatGPT- an innovative content generation tool

ChatGPT, an innovative content generation tool developed by OpenAI, employs machine learning algorithms to create text (Ariyaratne, 2023). Early users of ChatGPT have commended its ability to generate high-quality content, thanks to its utilization of Large Language Models (LLMs) trained on extensive data to predict the most suitable text, resulting in natural-sounding output. University students are increasingly turning to ChatGPT for assistance with assignments, literature reviews, and dissertations.

Nevertheless, several articles have raised concerns about the potential obsolescence of traditional assignments due to ChatGPT's ability to produce high-scoring papers and encourage critical thinking. The ethical and permissible boundaries of using ChatGPT in scientific writing remain uncertain (Gao, 2022).

Studies on the reliability of ChatGPT content

Research has been undertaken to rigorously assess the capabilities of chatbots in academic contexts. Yeadon et al. (2023) conducted a study evaluating ten physics essays for academic content, topic comprehension, and writing style. Their findings indicated that essays authored by AI exhibited high quality, albeit with variations similar to those found in human-written essays.

Herbold et al. (2023) compared essays produced by humans with those generated by ChatGPT, revealing that the latter yielded high-quality content, particularly in crafting argumentative essays. The AI-generated content displayed a structured format, often featuring similar introductory and concluding phrases like "Overall," "In summary," and "In conclusion." In contrast, human-written essays tended to be less structured, adhering less strictly to preset guidelines.

However, researchers have highlighted limitations of ChatGPT for academic research, notably its inability to generate original ideas, relying instead on preexisting training data. This reliance increases the risk of plagiarism significantly. Ariyaratne et al.'s 2023 study found that four out of five papers generated by ChatGPT contained errors, with one even including false references, raising concerns about the tool's authenticity in academic research. Kleebayon (2023) observed a tendency for ChatGPT to produce biased content.

ChatGPT vs Expert Writers

Liao and Sundar (2023) conducted an experiment describing the MAIN model for communication. The following is an excerpt from the study, serving as evidence that ChatGPT cannot be utilized for conducting a literature review:

“Prompt: Explain the MAIN Model by Shyam Sundar

ChatGPT: The MAIN Model, created by S. Shyam Sundar, is a theoretical framework that elucidates how individuals process and comprehend information they come across online and how the structure of digital media platforms influences this process. The abbreviation MAIN represents Media Affordances and Heuristics-based Information Processing. The model proposes that the structure of digital media platforms, particularly the technological capabilities of the platform, shapes the cognitive shortcuts that users utilize when processing information. Technological capabilities are the features and opportunities that a platform offers for users to engage with information, such as browsing, searching, and filtering. Cognitive shortcuts are mental strategies that people employ to simplify decision-making and information processing, such as the availability heuristic, the representativeness heuristic, and the affect heuristic. The MAIN model indicates that the structure of digital media platforms impacts how individuals process information, as different capabilities and shortcuts can result in varying outcomes in information processing. For instance, a platform with robust search capabilities may prompt users to use the availability heuristic and depend on the initial results they encounter, while a platform with strong social features may prompt users to use the affect heuristic and rely on others' opinions.”

To the uninitiated, the response appears correct. However, specialists in the field can discern its accuracy. Similarly, if a student solely references ChatGPT without consulting academic literature for their literature review, they may misinterpret concepts, potentially negatively impacting their review.

Furthermore, scholars argue that it is unethical to utilize ChatGPT for literature reviews and other academic research endeavors when it cannot serve as a co-author (Haman, 2023).

Here is the content for the MAIN model, written by a human:

The MAIN model, conceptualized by Shyam Sundar, is an acronym representing the four primary functions of media that can have psychological impacts on the audience: Modality (M), Agency (A), Interactivity (I), and Navigability (N). While the source and content of digital media are vital in establishing perceived credibility, the MAIN model emphasizes the technological attributes of digital media that can shape judgments of credibility.

Modality pertains to the format of the media and is the most visible aspect. It closely aligns with the notion of medium because traditionally, media exhibit different modalities, with print primarily textual, radio auditory, and television audiovisual. Agency, or the agent, serves as the origin, particularly on a psychological level, especially in the absence of another attributed source for specific information. Interactivity encompasses both interaction and activity. It also suggests that the medium is responsive to user input and capable of adapting to variations in user engagement. Through the provision of navigation tools, such as steering wheels, the interface can foster a sense of self-direction in users. Contemporary media offer users numerous choices for how to access and navigate through a mediated environment (Sundar, 2015).

Conclusion

ChatGPT is an Artificial Intelligence tool based on Large Language Models, trained to generate content resembling human writing. It has garnered considerable attention for its ability to produce content. Studies have demonstrated ChatGPT's capability to generate organized essays. However, there have been concerns regarding the reliability of sources, potential plagiarism, and content quality. A comparison between ChatGPT-generated content and human-authored content reveals inaccuracies in ChatGPT's output. Thus, students should approach its use for assignments and dissertations with caution. Nonetheless, ChatGPT can still be beneficial for structuring content.

About Tutors India

At Tutors India, we are a team of academic researchers and writers who assist students with writing a literature review. Our team is diverse, enabling us to guide students specialising in various disciplines. We ensure selecting articles from reputed journals and abide by the university guidelines for assignments, essays and dissertations. Moreover, we proofread and edit the content, ensuring the review is error- and plagiarism-free.

To know more about how a literature review is written in various fields, check out our literature review examples.

#tutorsindia #dissertationwritinghelp #researchproposal #chatgpt #chatgptvshuman #ai vs artists #literaturereview #dissertation #dissertation writing help #essaywriting #research #full dissertation help #research methodology

1 note · View note

jhavelikes · 2 years ago

Quote

Large Language Models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations) which can result in them making plausible but incorrect statements [1,2]. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pre-trained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best known results in important problems, pushing the boundary of existing LLM-based approaches [3]. Applying FunSearch to a central problem in extremal combinatorics — the cap set problem — we discover new constructions of large cap sets going beyond the best known ones, both in finite dimensional and asymptotic cases. This represents the first discoveries made for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve upon widely used baselines. In contrast to most computer search approaches, FunSearch searches for programs that describe how to solve a problem, rather than what the solution is. Beyond being an effective and scalable strategy, discovered programs tend to be more interpretable than raw solutions, enabling feedback loops between domain experts and FunSearch, and the deployment of such programs in real-world applications.

Mathematical discoveries from program search with large language models | Nature

0 notes

jcmarchi · 26 days ago

Text

New Research Papers Question ‘Token’ Pricing for AI Chats

New Post has been published on https://thedigitalinsider.com/new-research-papers-question-token-pricing-for-ai-chats/

New Research Papers Question ‘Token’ Pricing for AI Chats

New research shows that the way AI services bill by tokens hides the real cost from users. Providers can quietly inflate charges by fudging token counts or slipping in hidden steps. Some systems run extra processes that don’t affect the output but still show up on the bill. Auditing tools have been proposed, but without real oversight, users are left paying for more than they realize.

In nearly all cases, what we as consumers pay for AI-powered chat interfaces, such as ChatGPT-4o, is currently measured in tokens: invisible units of text that go unnoticed during use, yet are counted with exact precision for billing purposes; and though each exchange is priced by the number of tokens processed, the user has no direct way to confirm the count.

Despite our (at best) imperfect understanding of what we get for our purchased ‘token’ unit, token-based billing has become the standard approach across providers, resting on what may prove to be a precarious assumption of trust.

Token Words

A token is not quite the same as a word, though it often plays a similar role, and most providers use the term ‘token’ to describe small units of text such as words, punctuation marks, or word-fragments. The word ‘unbelievable’, for example, might be counted as a single token by one system, while another might split it into un, believ and able, with each piece increasing the cost.

This system applies to both the text a user inputs and the model’s reply, with the price based on the total number of these units.

The difficulty lies in the fact that users do not get to see this process. Most interfaces do not show token counts while a conversation is happening, and the way tokens are calculated is hard to reproduce. Even if a count is shown after a reply, it is too late to tell whether it was fair, creating a mismatch between what the user sees and what they are paying for.

Recent research points to deeper problems: one study shows how providers can overcharge without ever breaking the rules, simply by inflating token counts in ways that the user cannot see; another reveals the mismatch between what interfaces display and what is actually billed, leaving users with the illusion of efficiency where there may be none; and a third exposes how models routinely generate internal reasoning steps that are never shown to the user, yet still appear on the invoice.

The findings depict a system that seems precise, with exact numbers implying clarity, yet whose underlying logic remains hidden. Whether this is by design, or a structural flaw, the result is the same: users pay for more than they can see, and often more than they expect.

Cheaper by the Dozen?

In the first of these papers – titled Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives, from four researchers at the Max Planck Institute for Software Systems – the authors argue that the risks of token-based billing extend beyond opacity, pointing to a built-in incentive for providers to inflate token counts:

‘The core of the problem lies in the fact that the tokenization of a string is not unique. For example, consider that the user submits the prompt “Where does the next NeurIPS take place?” to the provider, the provider feeds it into an LLM, and the model generates the output “|San| Diego|” consisting of two tokens.

‘Since the user is oblivious to the generative process, a self-serving provider has the capacity to misreport the tokenization of the output to the user without even changing the underlying string. For instance, the provider could simply share the tokenization “|S|a|n| |D|i|e|g|o|” and overcharge the user for nine tokens instead of two!’

The paper presents a heuristic capable of performing this kind of disingenuous calculation without altering visible output, and without violating plausibility under typical decoding settings. Tested on models from the LLaMA, Mistral and Gemma series, using real prompts, the method achieves measurable overcharges without appearing anomalous:

Token inflation using ‘plausible misreporting’. Each panel shows the percentage of overcharged tokens resulting from a provider applying Algorithm 1 to outputs from 400 LMSYS prompts, under varying sampling parameters (m and p). All outputs were generated at temperature 1.3, with five repetitions per setting to calculate 90% confidence intervals. Source: https://arxiv.org/pdf/2505.21627

To address the problem, the researchers call for billing based on character count rather than tokens, arguing that this is the only approach that gives providers a reason to report usage honestly, and contending that if the goal is fair pricing, then tying cost to visible characters, not hidden processes, is the only option that stands up to scrutiny. Character-based pricing, they argue, would remove the motive to misreport while also rewarding shorter, more efficient outputs.

Here there are a number of extra considerations, however (in most cases conceded by the authors). Firstly, the character-based scheme proposed introduces additional business logic that may favor the vendor over the consumer:

‘[A] provider that never misreports has a clear incentive to generate the shortest possible output token sequence, and improve current tokenization algorithms such as BPE, so that they compress the output token sequence as much as possible’

The optimistic motif here is that the vendor is thus encouraged to produce concise and more meaningful and valuable output. In practice, there are obviously less virtuous ways for a provider to reduce text-count.

Secondly, it is reasonable to assume, the authors state, that companies would likely require legislation in order to transit from the arcane token system to a clearer, text-based billing method. Down the line, an insurgent startup may decide to differentiate their product by launching it with this kind of pricing model; but anyone with a truly competitive product (and operating at a lower scale than EEE category) is disincentivized to do this.

Finally, larcenous algorithms such as the authors propose would come with their own computational cost; if the expense of calculating an ‘upcharge’ exceeded the potential profit benefit, the scheme would clearly have no merit. However the researchers emphasize that their proposed algorithm is effective and economical.

The authors provide the code for their theories at GitHub.

The Switch

The second paper – titled Invisible Tokens, Visible Bills: The Urgent Need to Audit Hidden Operations in Opaque LLM Services, from researchers at the University of Maryland and Berkeley – argues that misaligned incentives in commercial language model APIs are not limited to token splitting, but extend to entire classes of hidden operations.

These include internal model calls, speculative reasoning, tool usage, and multi-agent interactions – all of which may be billed to the user without visibility or recourse.

Pricing and transparency of reasoning LLM APIs across major providers. All listed services charge users for hidden internal reasoning tokens, and none make these tokens visible at runtime. Costs vary significantly, with OpenAI’s o1-pro model charging ten times more per million tokens than Claude Opus 4 or Gemini 2.5 Pro, despite equal opacity. Source: https://www.arxiv.org/pdf/2505.18471

Unlike conventional billing, where the quantity and quality of services are verifiable, the authors contend that today’s LLM platforms operate under structural opacity: users are charged based on reported token and API usage, but have no means to confirm that these metrics reflect real or necessary work.

The paper identifies two key forms of manipulation: quantity inflation, where the number of tokens or calls is increased without user benefit; and quality downgrade, where lower-performing models or tools are silently used in place of premium components:

‘In reasoning LLM APIs, providers often maintain multiple variants of the same model family, differing in capacity, training data, or optimization strategy (e.g., ChatGPT o1, o3). Model downgrade refers to the silent substitution of lower-cost models, which may introduce misalignment between expected and actual service quality.

‘For example, a prompt may be processed by a smaller-sized model, while billing remains unchanged. This practice is difficult for users to detect, as the final answer may still appear plausible for many tasks.’

The paper documents instances where more than ninety percent of billed tokens were never shown to users, with internal reasoning inflating token usage by a factor greater than twenty. Justified or not, the opacity of these steps denies users any basis for evaluating their relevance or legitimacy.

In agentic systems, the opacity increases, as internal exchanges between AI agents can each incur charges without meaningfully affecting the final output:

‘Beyond internal reasoning, agents communicate by exchanging prompts, summaries, and planning instructions. Each agent both interprets inputs from others and generates outputs to guide the workflow. These inter-agent messages may consume substantial tokens, which are often not directly visible to end users.

‘All tokens consumed during agent coordination, including generated prompts, responses, and tool-related instructions, are typically not surfaced to the user. When the agents themselves use reasoning models, billing becomes even more opaque’

To confront these issues, the authors propose a layered auditing framework involving cryptographic proofs of internal activity, verifiable markers of model or tool identity, and independent oversight. The underlying concern, however, is structural: current LLM billing schemes depend on a persistent asymmetry of information, leaving users exposed to costs that they cannot verify or break down.

Counting the Invisible

The final paper, from researchers at the University of Maryland, re-frames the billing problem not as a question of misuse or misreporting, but of structure. The paper – titled CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs, and from ten researchers at the University of Maryland – observes that most commercial LLM services now hide the intermediate reasoning that contributes to a model’s final answer, yet still charge for those tokens.

The paper asserts that this creates an unobservable billing surface where entire sequences can be fabricated, injected, or inflated without detection*:

‘[This] invisibility allows providers to misreport token counts or inject low-cost, fabricated reasoning tokens to artificially inflate token counts. We refer to this practice as token count inflation.

‘For instance, a single high-efficiency ARC-AGI run by OpenAI’s o3 model consumed 111 million tokens, costing $66,772.3 Given this scale, even small manipulations can lead to substantial financial impact.

‘Such information asymmetry allows AI companies to significantly overcharge users, thereby undermining their interests.’

To counter this asymmetry, the authors propose CoIn, a third-party auditing system designed to verify hidden tokens without revealing their contents, and which uses hashed fingerprints and semantic checks to spot signs of inflation.

Overview of the CoIn auditing system for opaque commercial LLMs. Panel A shows how reasoning token embeddings are hashed into a Merkle tree for token count verification without revealing token contents. Panel B illustrates semantic validity checks, where lightweight neural networks compare reasoning blocks to the final answer. Together, these components allow third-party auditors to detect hidden token inflation while preserving the confidentiality of proprietary model behavior. Source: https://arxiv.org/pdf/2505.13778

One component verifies token counts cryptographically using a Merkle tree; the other assesses the relevance of the hidden content by comparing it to the answer embedding. This allows auditors to detect padding or irrelevance – signs that tokens are being inserted simply to hike up the bill.

When deployed in tests, CoIn achieved a detection success rate of nearly 95% for some forms of inflation, with minimal exposure of the underlying data. Though the system still depends on voluntary cooperation from providers, and has limited resolution in edge cases, its broader point is unmistakable: the very architecture of current LLM billing assumes an honesty that cannot be verified.

Conclusion

Besides the advantage of gaining pre-payment from users, a scrip-based currency (such as the ‘buzz’ system at CivitAI) helps to abstract users away from the true value of the currency they are spending, or the commodity they are buying. Likewise, giving a vendor leeway to define their own units of measurement further leaves the consumer in the dark about what they are actually spending, in terms of real money.

Like the lack of clocks in Las Vegas, measures of this kind are often aimed at making the consumer reckless or indifferent to cost.

The scarcely-understood token, which can be consumed and defined in so many ways, is perhaps not a suitable unit of measurement for LLM consumption – not least because it can cost many times more tokens to calculate a poorer LLM result in a non-English language, compared to an English-based session.

However, character-based output, as suggested by the Max Planck researchers, would likely favor more concise languages and penalize naturally verbose languages. Since visual indications such as a depreciating token counter would probably make us a little more spendthrift in our LLM sessions, it seems unlikely that such useful GUI additions are coming anytime soon – at least without legislative action.

* Authors’ emphases. My conversion of the authors’ inline citations to hyperlinks.

First published Thursday, May 29, 2025

0 notes

govindhtech · 1 year ago

Text

NVIDIA Energies Meta’s HyperLlama 3: Faster AI for All

Today, NVIDIA revealed platform-wide optimisations aimed at speeding up Meta Llama 3, the most recent iteration of the large language model (LLM).

When paired with NVIDIA accelerated computing, the open approach empowers developers, researchers, and companies to responsibly innovate across a broad range of applications.

Educated using NVIDIA AI Using a computer cluster with 24,576 NVIDIA H100 Tensor Core GPUs connected by an NVIDIA Quantum-2 InfiniBand network, meta engineers trained Llama 3. Meta fine-tuned its network, software, and model designs for its flagship LLM with assistance from NVIDIA.

In an effort to push the boundaries of generative AI even farther, Meta recently revealed its intentions to expand its infrastructure to 350,000 H100 GPUs.

Aims Meta for Llama 3 Meta’s goal with Llama 3 was to create the greatest open models that could compete with the finest proprietary models on the market right now. In order to make Llama 3 more beneficial overall, Meta sought to address developer comments. They are doing this while keeping up their leadership position in the responsible use and deployment of LLMs.

In order to give the community access to these models while they are still under development, they are adopting the open source philosophy of publishing frequently and early. The Llama 3 model collection begins with the text-based models that are being released today. In the near future, the meta objective is to extend the context, enable multilingual and multimodal Llama 3, and keep enhancing overall performance in key LLM functions like coding and reasoning.

Exemplar Architecture For Llama 3, they went with a somewhat conventional decoder-only transformer architecture in keeping with the Meta design concept. They improved upon Llama 2 in a number of significant ways. With a vocabulary of 128K tokens, Llama 3’s tokenizer encodes language far more effectively, significantly enhancing model performance. In order to enhance the inference performance of Llama 3 models, grouped query attention (GQA) has been implemented for both the 8B and 70B sizes. They used a mask to make sure self-attention does not transcend document borders when training the models on sequences of 8,192 tokens.

Training Information Curating a sizable, excellent training dataset is essential to developing the best language model. They made a significant investment in pretraining data, adhering to the principles of Meta design. More than 15 trillion tokens, all gathered from publically accessible sources, are used to pretrained Llama 3. The meta training dataset has four times more code and is seven times larger than the one used for Llama 2. More over 5 percent of the Llama 3 pretraining dataset is composed of high-quality non-English data covering more than 30 languages, in anticipation of future multilingual use cases. They do not, however, anticipate the same calibre of performance in these languages as they do in English.

They created a number of data-filtering procedures to guarantee that Llama 3 is trained on the best possible data. To anticipate data quality, these pipelines use text classifiers, NSFW filters, heuristic filters, and semantic deduplication techniques. They discovered that earlier iterations of Llama are remarkably adept at spotting high-quality data, so they trained the text-quality classifiers that underpin Llama 3 using data from Llama 2.

In-depth tests were also conducted to determine the optimal methods for combining data from various sources in the Meta final pretraining dataset. Through these tests, we were able to determine the right combination of data that will guarantee Llama 3’s performance in a variety of use scenarios, such as trivia, STEM, coding, historical knowledge, etc.

Next for Llama 3: What? The first models they intend to produce for Llama 3 are the 8B and 70B variants. And there will be a great deal more.

The meta team is thrilled with how these models are trending, even though the largest models have over 400B parameters and are still in the training phase. They plan to release several models with more features in the upcoming months, such as multimodality, multilingual communication, extended context windows, and enhanced overall capabilities. When they have finished training Llama 3, they will also release an extensive research article.

They thought they could offer some pictures of how the Meta biggest LLM model is trending to give you an idea of where these models are at this point in their training. Please be aware that the models released today do not have these capabilities, and that the data is based on an early checkpoint of Llama 3 that is still undergoing training.

Utilising Llama 3 for Tasks Versions of Llama 3, optimised for NVIDIA GPUs, are currently accessible for cloud, data centre, edge, and PC applications.

Developers can test it via a browser at ai.nvidia.com. It comes deployed as an NVIDIA NIM microservice that can be used anywhere and has a standard application programming interface.

Using NVIDIA NeMo, an open-source LLM framework that is a component of the safe and supported NVIDIA AI Enterprise platform, businesses may fine-tune Llama 3 based on their data. NVIDIA TensorRT-LLM can be used to optimise custom models for inference, and NVIDIA Triton Inference Server can be used to deploy them.

Bringing Llama 3 to Computers and Devices Moreover, it utilizes NVIDIA Jetson Orin for edge computing and robotics applications, generating interactive agents similar to those seen in the Jetson AI Lab.

Furthermore, workstation and PC GPUs from NVIDIA and GeForce RTX accelerate Llama 3 inference. Developers can aim for over 100 million NVIDIA-accelerated systems globally using these systems.

Llama 3 Offers Optimal Performance The best techniques for implementing a chatbot’s LLM balance low latency, fast reading speed, and economical GPU utilisation.

Tokens, or roughly the equivalent of words, must be delivered to an LLM by such a service at a rate of around double the user’s reading speed, or 10 tokens per second.

Using these measurements, an initial test using the version of Llama 3 with 70 billion parameters showed that a single NVIDIA H200 Tensor Core GPU generated roughly 3,000 tokens/second, adequate to serve about 300 simultaneous users.

Thus, by serving over 2,400 users concurrently, a single NVIDIA HGX server equipped with eight H200 GPUs may deliver 24,000 tokens/second and further optimise expenses.

With eight billion parameters, the Llama 3 version for edge devices produced up to 40 tokens/second on the Jetson AGX Orin and 15 tokens/second on the Jetson Orin Nano.

Progression of Community Models As a frequent contributor to open-source software, NVIDIA is dedicated to enhancing community software that supports users in overcoming the most difficult obstacles. Additionally, open-source models encourage AI openness and enable widespread user sharing of research on AI resilience and safety.