#EmpiricalMethods | Explore Tumblr posts and blogs

govindhtech · 29 days ago

Text

Structured Object Language Model: Lightweight Language Model

Structured OLM

Amazon's lightweight NLP approach, Structured Object Language approach (SoLM), generates structured objects within schemas. Discover how self-supervised denoising and CABS reduce hallucinations.

One of the most important features of today's generative models is their ability to convert unstructured, partially unstructured, or poorly structured inputs into structured objects that follow schemas like relational-database fixed schemas, document store flexible schemas, function signatures, API specifications, etc.

Large language models (LLMs) can do this work if given schema requirements and processing instructions. Most LLMs today have a JSON mode or structured-outputs mode to safeguard users from prompt engineering.

This method has various limitations. First, LLMs are expensive when expanding to databases with millions or billions of entries or requests. Second, prompt engineering can be complicated. Third, the built-in structured-outputs and JSON modes can only support so many schemas.

In Empirical Methods in Natural Language Processing (EMNLP), Amazon published a lightweight Structured object language Model (SoLM) on ArXiv to solve this problem locally. Unlike general-purpose LLMs, structured object language models are trained to produce objects only within a schema. SoLM's achievements include self-supervised denoising, a unique training method, and confidence-aware substructure beam search (CABS), a decoding method for inference time that lowers hallucinations.

In tests, Structured object language Model had an order of magnitude higher cost efficiency than state-of-the-art LLMs and comparable output accuracy. Found that CABS decoding outperformed beam search decoding in product attribute generation recall by 16.7% at 90% accuracy.

Applications

The structured-output paradigm unites seemingly unconnected AI/ML difficulties in study. A issue could arise if the structured entity has multiple aspects or redundant, interconnected information. A brief, type-constrained structured data may be part of the object, along with a long, natural language description text.

Multidimensional objects with descriptive and key attribute listings are often used to list things, homes, occupations, etc. Structured object language Model lets you design an object with absolute world knowledge consistency and relative object consistency.

Typically, a structured-output model is fed unstructured data and allowed to produce a structured object. Amazon suggests a self-regenerating machine using Structured object language Model in study. It just gives the model a schema-structured object and lets it replicate it.

Instead of structuring, clean, normalise, correct, and/or finish the input while making it self-consistent. Input can contain extra unstructured content, a structured record, or a record with a different schema. Structured object language Model always produces a clean record according to the schema, regardless of input.

The self-regenerating machine can repair incorrect facts, normalise unnormalized facts, complete missing descriptions, and correct inaccurate descriptions. Since these occupations are interconnected, doing them separately produces dependence cycles (e.g., should one develop descriptions based on facts or extract facts from descriptions?). These dependencies are best solved organically by self-regeneration.

Innovations

Amazon trains Structured object language Model with self-supervised denoising. The idea is to use any sample of things from a database, add false noise, then train the model to restore their original shapes. What Amazon feeds the model improves in quality. More aggressive noise, such as destroying the object's structure or randomly rearranging tokens, teaches the model to work with unstructured input and improves object quality.

Even if they are programmed to offer the most likely next token in a series, LLMs use multiple decoding algorithms to determine outputs at inference time. One of the most common is beam search decoding, where the model parallelises multiple candidate sequences and chooses the best cumulative probability. Greedy decoding simply chooses the token with the highest probability at each round, therefore it cannot guarantee the highest-probability sequence across a set number of turns. The beam width represents the number of sequences the model considers.

Structured object language produce Models are collections of key-value pairs, where the value is the type's value, such as an item's brand, and the key is a schema data type, such as “brand” in product listings. Amazon uses tokens (“” and “”) to identify keys and values.

The key-value pair, not the token, is the atomic component of confidence-aware substructure beam search. The key-value pair's likelihood can be calculated from the LLM's output confidence level. Another experiment used an independently trained confidence score model that used one of the LLM's inner layers' intermediate representation. This worked better than using model confidence ratings alone.

Amazon proves a 7 billion parameter Structured object language Model model equals or exceeds prompt-engineering methods on considerably bigger foundational models in fact completeness, accuracy, and descriptive content quality and factuality. CABS decoding greatly improves fact accuracy by removing hallucinated facts.

#StructuredObjectLanguageModel #NaturalLanguageProcessing #largelanguagemodels #SoLM #EmpiricalMethods #StructuredobjectlanguageModelSoLM #technology #technews #technologynews #news #govindhtech

0 notes