#OpenAIo1preview | Explore Tumblr posts and blogs

govindhtech · 8 months ago

Text

Presenting Claude 3.5 Haiku, A New Sonnet, And Computer Use

A new model, the Claude 3.5 Haiku, and an upgraded Claude 3.5 Sonnet are being unveiled today. The updated Claude 3.5 Sonnet outperforms its predecessor in every way, but it excels in coding, where it was already at the top of the field.

Additionally, it is launching a revolutionary new feature in public beta: computer use. Developers may instruct Claude to use computers the same way people do by pointing at a screen, moving a cursor, pressing buttons, and entering text using the API, which is now available. The first frontier AI model to be made available for public beta use is Claude 3.5 Sonnet. It is still experimental at this point and can be difficult and prone to mistakes. Claude anticipates that the capability will advance quickly over time, and it is releasing PC use early for developer feedback.

Companies like Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already started to investigate these possibilities by doing activities that call for dozens or even hundreds of stages. For instance, Replit is creating a crucial feature that assesses apps while they are being developed for their Replit Agent product by utilizing Claude 3.5 Sonnet’s computer use and user interface navigating capabilities.

All users can now access the updated Claude 3.5 Sonnet. Developers can begin using the computer-based beta on Google Cloud’s Vertex AI, Amazon Bedrock, and the Anthropic API today. Later this month, the new Claude 3.5 Haiku will be available.Image credit to Anthropic

Claude 3.5 Sonnet: Prominent expertise in software engineering

Wide-ranging improvements on industrial benchmarks are demonstrated by the upgraded Claude 3.5 Sonnet, with notable improvements in tasks involving tool use and agentic coding. In terms of coding, it outperforms all publicly available models, including reasoning models like OpenAI o1-preview and specialized systems made for agentic coding, increasing performance on SWE-bench Verified from 33.4% to 49.0%. Additionally, it increases performance on the agentic tool usage task TAU-bench from 36.0% to 46.0% in the more difficult airline domain and from 62.6% to 69.2% in the retail domain. These improvements are available in the new Claude 3.5 Sonnet at the same cost and speed as the original.

According to early user comments, the updated Claude 3.5 Sonnet marks a substantial advancement in AI-powered coding. GitLab, which tested the model for DevSecOps tasks, discovered that it supported multi-step software development processes with no additional latency and provided stronger reasoning (up to 10% across use cases). In comparison to the previous edition, Cognition saw significant gains in coding, planning, and problem-solving skills and employs the new Claude 3.5 Sonnet for autonomous AI evaluations. The Browser Company observed that Claude 3.5 Sonnet performed better than any other model they had tried when they used it to automate web-based workflows.

The US AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI) jointly pre-deployed the new Claude 3.5 Sonnet model as part of its ongoing endeavor to collaborate with outside specialists.

The ASL-2 Standard, as described in its Responsible Scaling Policy, is still suitable for this model, according to its assessment of the enhanced Claude 3.5 Sonnet for catastrophic risks.

Claude 3.5 Haiku: Cutting edge combined with speed and affordability

The next iteration of Claude’s quickest model is called Claude 3.5 Haiku. Claude 3.5 Haiku outperforms even Claude 3 Opus, the largest model in its previous generation, on most intelligence benchmarks and gains improvements across all skill sets for the same price and speed as Claude 3 Haiku. Claude 3.5 Haiku excels in coding assignments. For instance, it outperforms numerous agents utilizing publicly accessible state-of-the-art models, such as the original Claude 3.5 Sonnet and GPT-4o, with a score of 40.6% on SWE-bench Verified.

Claude 3.5 Haiku’s low latency, enhanced instruction following, and more precise tool use make it ideal for user-facing products, specialized sub-agent tasks, and creating customized experiences from massive amounts of data, such as pricing, inventory records, or purchase histories.

Use cases

Claude 3.5 Haiku is ideally suited for user-facing products, specialized sub-agent tasks, and creating personalized experiences from massive amounts of data because of its quick speeds, enhanced instruction following, and more precise tool use. Typical usage cases include of:

Code completions

Claude 3.5 Haiku speeds up development operations by providing precise, fast code completions and suggestions. Software teams trying to increase productivity and streamline their coding process will find it excellent.

Chatbots that are interactive

Claude 3.5 has improved speaking skills and quick reaction times. Haiku is excellent at enabling chatbots that are responsive and able to manage large numbers of user interactions. Customer service, e-commerce, and educational platforms that need scaled engagement will find it very useful.

Labeling and data extraction

Claude 3.5 Haiku is useful for quick data extraction and automatic labeling activities since it effectively processes and classifies information. Organizations working with substantial amounts of unstructured data in the fields of research, healthcare, and finance may find this feature particularly helpful.

Moderation of content in real time

Claude 3.5 Haiku’s enhanced reasoning and content comprehension skills enable dependable, instantaneous content moderation. Because of this, social media platforms, internet forums, and media companies that need to consistently provide appropriate and safe content find it useful.

Pricing and availability

Later this month, Claude 3.5 Haiku first as a text-only model with the addition of image input will be made accessible through its first-party API, Amazon Bedrock, and Google Cloud’s Vertex AI.

Starting at $0.25 per million input tokens and $1.25 per million output tokens, Claude 3.5 Haiku offers 50% cost savings with the Message Batches API and up to 90% cost savings with quick caching.

Claude is being taught responsible computer usage

Claude is attempting something essentially novel with computer use. It is teaching Claude general computer skills, which will enable it to use a variety of conventional tools and software applications made for humans, rather than creating specialized tools to assist him in doing specific tasks. This emerging capability can be used by developers to design and test software, automate repetitive procedures, and carry out open-ended tasks like research.

These general skills are made possible by an API Claude designed that lets Claude view and interact with computer interfaces. To enable Claude to convert instructions (like “use data from my computer and online to fill out this form”) into computer commands (like “check a spreadsheet,” “move the cursor Developers can incorporate this API to “open a web browser,” “navigate to the relevant web pages,” “fill out a form with the data from those pages,” and so on.

In the screenshot-only category, Claude 3.5 Sonnet received a score of 14.9% on OSWorld, which assesses AI models’ proficiency with computers, which is significantly higher than the score of 7.8% for the next-best AI system. Claude received a score of 22.0% when given additional steps to finish the challenge.

Although Claude anticipates that this capacity will quickly increase in the upcoming months, Claude’s computer skills are now lacking. It advises developers to start their experimentation with low-risk activities because Claude currently has trouble performing some actions that people do with ease, like scrolling, dragging, and zooming. It is proactively promoting the safe deployment of computers since they may offer a new avenue for more well-known problems like fraud, spam, or disinformation. It has created new classifiers that can determine whether harm is occurring and when computer use is occurring. In its piece on developing computer use, you can read more about the study process that went into this new ability as well as additional safety precautions.

Considering the future

The promise and consequences of increasingly powerful AI systems will become clearer to us as we learn from the early implementations of this technology, which is still in its infancy.

Claude 3.5 Haiku (coming soon), PC use (public beta), and the upgraded Claude 3.5 Sonnet from Anthropic (available now) are all available on Amazon Bedrock.

The updated Claude 3.5 Sonnet costs the same as the original and is currently available in the US West (Oregon) AWS Region on Amazon Bedrock.

Along with the improved model’s increased intelligence, developers may now include computer use (available in public beta) into their apps to improve software testing procedures, automate intricate desktop workflows, and produce increasingly complicated AI-powered applications.

In the upcoming weeks, Claude 3.5 Haiku will be made available, first as a text-only model and then with the ability to add images.

Read more on govindhtech.com

#PresentingClaude35Haiku #NewSonnet #Computer #Claude35Sonnet #Anthropic #OpenAIo1preview #Claude3Opus #Claude3Haiku #GPT4o #AmazonBedrock #SonnetAnthropic #TECHNOLOGY #TECHNEWS #NEWS #GOVINDHTECH

0 notes

govindhtech · 9 months ago

Text

OpenAI o1-preview, o1-mini: Advanced Reasoning Models

OpenAI o1-preview, OpenAI o1-mini, A new collection of models for reasoning that address challenging issues.

OpenAI o1-preview

OpenAI has created a new line of AI models that are meant to deliberate longer before reacting. Compared to earlier versions, they can reason their way through challenging tasks and tackle more challenging math, science, and coding challenges.

- Advertisement -

The first installment of this series is now available through ChatGPT and its API. OpenAI anticipates frequent upgrades and enhancements as this is only a preview. OpenAI is also including evaluations for the upcoming upgrade, which is presently being developed, with this release.

How it functions

These models were trained to think through situations more thoroughly before responding, much like a human would. They learn to try various tactics, improve their thought processes, and own up to their mistakes through training.

In OpenAI experiments, the upcoming model upgrade outperforms PhD students on hard benchmark tasks in biology, chemistry, and physics. It also performs exceptionally well in coding and math. GPT-4o accurately answered only 13% of the questions in an exam used to qualify for the International Mathematics Olympiad (IMO), compared to 83% for the reasoning model. Their coding skills were tested in competitions, and in Codeforces tournaments, they scored in the 89th percentile.

Many of the functions that make ChatGPT valuable are still missing from this early model, such as posting files and photographs and searching the web for information. In the near future, GPT-4o will be more capable in many typical instances.

- Advertisement -

However, this marks a new level of AI power and a substantial advancement for complicated thinking tasks. In light of this, OpenAI is calling this series OpenAI o1-preview and resetting the counter to 1.

Security

In the process of creating these new models, OpenAI is also developed a novel method for safety training that uses the models’ capacity for reasoning to force compliance with safety and alignment requirements. It can implement their safety regulations more successfully by reasoning about them in the context of the situation.

Testing how effectively their model adheres to its safety guidelines in the event that a user attempts to circumvent a process known as “jailbreaking” is one method they gauge safety. GPT-4o received a score of 22 (out of 100) on one of OpenAI’s most difficult jailbreaking tests, but OpenAI o1-preview model received an 84. Further information about this can be found in their study post and the system card.

OpenAI has strengthened its safety work, internal governance, and federal government coordination to match the enhanced capabilities of these models. This includes board-level review procedures, such as those conducted by its Safety & Security Committee, best-in-class red teaming, and thorough testing and evaluations utilizing its Preparedness Framework.

OpenAI recently finalized collaborations with the AI Safety Institutes in the United States and the United Kingdom to further its commitment to AI safety. OpenAI has initiated the process of putting these agreements into practice by providing the institutes with preliminary access to a research version of this model. This was a crucial initial step in its collaboration, assisting in the development of a procedure for future model research, assessment, and testing both before and after their public release.

For whom it is intended

These improved thinking skills could come in handy while solving challenging puzzles in math, science, computing, and related subjects. For instance, physicists can use OpenAI o1-preview to create complex mathematical formulas required for quantum optics, healthcare researchers can use it to annotate cell sequencing data, and developers across all domains can use it to create and implement multi-step workflows.

OpenAI O1-mini

The o1 series is excellent at producing and debugging complex code with accuracy. OpenAI is also launching OpenAI o1-mini, a quicker, less expensive reasoning model that excels at coding, to provide developers with an even more effective option. For applications requiring reasoning but not extensive domain knowledge, o1-mini is a powerful and economical model because it is smaller and costs 80% less than o1-preview.

How OpenAI o1 is used

Users of ChatGPT Plus and Team will have access to o1 models as of right now. The model selector allows you to manually choose between o1-preview and o1-mini. The weekly rate limits at launch will be 30 messages for o1-preview and 50 for o1-mini. The goal is to raise those rates and make ChatGPT capable of selecting the appropriate model on its own for each request.

Users of ChatGPT Edu and Enterprise will have access to both models starting next week.

With a rate limit of 20 RPM, developers that meet the requirements for API usage tier 5(opens in a new window) can begin prototyping with both models in the API right now. Following more testing, OpenAI aims to raise these restrictions. Currently, these models lack support for system messaging, streaming, function calling, and other capabilities in their API. Check out the API documentation to get started.

OpenAI also intends to provide all ChatGPT Free users with access to o1-mini.

Next up

These reasoning models are now available in ChatGPT and the API as an early release. To make them more helpful to everyone, it plans to add browsing, file and image uploading, and other capabilities in addition to model updates.

In addition to the new OpenAI o1 series, OpenAI also wants to keep creating and publishing models in its GPT series.