Kinara Ara-2 Processor Hits 12 Tokens Per Second Running 7 Billion Parameter LLMs
August 08 2024 - 5:30AM
Business Wire
Generative AI capabilities of this leading-edge
AI processor are demonstrated in new video available on YouTube
Kinara™, Inc., today has proven that its low-power, low-cost AI
processor, the Kinara Ara-2, has mastered the heavy demands of
accurately and efficiently running Generative AI applications such
as Large Language Models (LLMs) at the edge. Specifically, the
company is demonstrating the flawless operation of the Qwen1.5-7B
model running on a single Ara-2 processor at 12 output tokens per
second. This capability, depicted in the new online video entitled
‘Kinara Ara-2 Masters Local LLM Chatbot’, is an important
accomplishment because LLMs, and Generative AI in general, must be
run on the edge to ensure data privacy and reduce latency by
removing the need for Internet access. Furthermore, with Generative
AI processing at the edge, the user only pays a one-time cost for
the integrated hardware in their personal computers and avoids
expensive cloud usage costs. Generative AI processing at the edge
increases the functionality of PCs, offering users the ability to
perform documentation summarization, transcription, translation,
and other beneficial and time-saving tasks.
This press release features multimedia. View
the full release here:
https://www.businesswire.com/news/home/20240808164943/en/
Under the hood of this PC is the Kinara
Ara-2 doing the heavy lifting of running an LLM Chatbot on a PC.
Ara-2 is Kinara’s latest AI processor. It provides the simplest
path for users to upgrade their PCs and embedded systems to join
the new age of Generative AI. Run large language models (LLMs) for
increased productivity or stable diffusion models to generate cool
images. (Photo: Business Wire)
Qwen, available as open source under the Apache 2.0 license and
backed by Alibaba Cloud (Tongyi Qianwen), is like LLaMA2, and
represents a series of models across diverse sizes (e.g., 0.5B, 4B,
7B, 14B, 32B, 72B) and various functions including chat, language
understanding, reasoning, math, and coding. From a Natural Language
Processing (NLP) perspective, Qwen can be used to process commands
that a user performs in day-to-day operations on their computer.
And unlike the voice command processing typically available in
cars, Qwen and other Generative AI chat models are multilingual,
accurate, and are not restricted to specific text sequences.
Beyond generating simple and complex output text prompts at 12
tokens per second, effectively running Qwen1.5-7B and any other LLM
on the edge requires the Kinara Ara-2 to support three high-level
features: 1) the ability to aggressively quantize LLMs and other
generative AI workloads while still delivering near floating-point
accuracy; 2) extreme flexibility and capability to run all LLM
operators end-to-end without relying on the host (this includes all
model layers and activation functions); and 3) sufficient memory
size and bandwidth to effectively handle these extremely large
neural networks.
“Running any LLM on a low-power edge AI processor is quite a
feat but hitting 12 output tokens per second on a 7B parameter LLM
is a major accomplishment,” said Wajahat Qadeer, Kinara’s chief
architect. “However, the best is yet to come, as we are on target
to hit 15 output tokens per second by applying advanced software
techniques while leaving the model itself unmodified.”
With existing LLMs and new LLMs that become available on Hugging
Face and elsewhere, Kinara can quickly bring up these models by
leveraging its innovative software and architectural flexibility,
executing these models with floating-point accuracy, while offering
the low power dissipation of an integer processor. And beyond
Generative AI applications, Ara-2 is very capable of handling
16-32+ video streams fed into edge servers for high-end object
detection, recognition, and tracking, using its advanced compute
engines to process higher resolution images quickly and with high
accuracy. Ara-2 is available as a stand-alone device, a USB module,
an M.2 module, and a PCIe card featuring multiple Ara-2’s.
Interested parties are invited to contact Kinara directly to see
for themselves the Qwen1.5-7B and other LLM applications running on
Ara-2.
About Kinara
Kinara provides the world’s most power- and price-efficient Edge
AI inference platform supported by comprehensive AI software
development tools. Enabling Generative AI and smart applications
across retail, medical, industry 4.0, automotive, and smart cities,
Kinara’s AI processors, modules, and software can be found at the
heart of the AI industry’s most exciting and influential
innovations. Kinara envisions a world of exceptional customer
experiences, better manufacturing efficiency, and greater safety
for all. Learn more at https://kinara.ai/
All registered trademarks and other trademarks belong to their
respective owners.
View source
version on businesswire.com: https://www.businesswire.com/news/home/20240808164943/en/
Kinara Contact
Napier Partnership: Nesbert Musuwo, Account Manager,
Napier B2B Email Address: Nesbert@Napierb2b.com