ARSA Technology Portfolio: Embedded Automatic Speech Recognition System for Botika

Written by ARSA Technology Admin



Portfolio

Project Overview

Client: Botika (PT Botika Teknologi Indonesia)
Project Code: AR/IZ190805001
Location: Surabaya, Indonesia
Sector: AI/ML – Voice Recognition & Natural Language Processing
Solution Deployed: Embedded ASR System with Mozilla DeepSpeech Integration
Contract Value: ~$10,000 USD
Timeline: 30 Working Days (August 2019)
Deployment Platform: Embedded Computing (Raspberry Pi, RK3328 SoC, MediaTek/Allwinner/Intel-based boards)

Business Problem

Voice interface deployment in Indonesian language contexts faces critical barriers:

Cloud dependency: Existing ASR solutions (Google Speech-to-Text, AWS Transcribe) require continuous internet connectivity, introducing latency (300-800ms) and recurring API costs ($0.006-$0.024 per 15 seconds)
Language model limitations: Commercial ASR systems perform poorly on Indonesian language, regional dialects, and domain-specific vocabulary (accuracy <70% in specialized contexts)
Privacy/security constraints: Healthcare, banking, government sectors cannot transmit voice data to external cloud servers due to regulatory compliance
Cost scalability ceiling: Per-transaction API pricing becomes prohibitive at enterprise scale (>100,000 monthly queries)

Client Impact: Botika required real-time Indonesian voice recognition for embedded medical device applications where cloud connectivity is unreliable and patient data privacy is non-negotiable.

ARSA Solution Architecture

Core Technology Stack

Mozilla DeepSpeech Foundation

Open-source speech-to-text engine based on Baidu’s Deep Speech research
TensorFlow-based neural network architecture
Customizable acoustic and language models for Indonesian language optimization

ARSA Custom Implementation Layers

1. Electronics R&D & Hardware Integration

Development boards: MediaTek/Allwinner/Intel-based SoC platforms
Target deployment: ARM-based single-board computers (Raspberry Pi 3B+/4, RK3328, RK3399)
Programming rig assembly for firmware flashing and embedded testing
Peripheral configuration: microphone arrays, audio preprocessing circuits

2. Firmware/Kernel Layer

SoC-specific kernel compilation and optimization
EEPROM bootloader configuration for standalone operation
Audio driver integration (ALSA/PulseAudio) with hardware-accelerated DSP
Power management for battery-operated deployment scenarios

3. Linux Software Layer

Real-time audio capture pipeline:
- PyAudio-based buffer management (44.1kHz → 16kHz resampling via FFmpeg)
- Voice Activity Detection (VAD) with dual-threshold triggering:
  - preThreshold = 10: Start recording when RMS exceeds baseline
  - postThreshold = 5: Stop recording after 1-second silence
- Automatic segmentation eliminates manual start/stop interaction
DeepSpeech inference engine:
- Model loading: Custom-trained output_graph.pb (acoustic model) + alphabet.txt (Indonesian phonemes)
- Language model: lm.binary (n-gram probabilities) + trie (word prefix tree) for context-aware decoding
- Beam search decoder (width=500) with alpha/beta hyperparameters tuned for Indonesian syntax
- MFCC feature extraction (26 coefficients, 9-frame context window)
Server integration:
- HTTP GET-based result transmission to client backend (/stt.php?stt=[result])
- Modular architecture allows MQTT, WebSocket, or REST API integration

Technical Workflow

Audio Input → Voice Activity Detection → Recording Trigger
       ↓
Buffer Accumulation (1-sec silence timeout)
       ↓
WAV File Generation (44.1kHz) → FFmpeg Resampling (16kHz)
       ↓
DeepSpeech Inference (Acoustic Model + Language Model)
       ↓
Text Output → HTTP POST to Client Server
       ↓
[Return to Listening State]

Performance Characteristics:

Power consumption: 2.5-4.5W during active inference (suitable for battery operation)
Inference latency: 0.5-1.2 seconds for 3-second audio clip (CPU-only on ARM Cortex-A53)
Accuracy: 85-92% Word Error Rate (WER) on trained Indonesian vocabulary domain

Strategic Value Delivered

Client-Specific Gains

Operational Independence

Zero cloud API costs after initial deployment
Offline operation: No internet connectivity required
Data sovereignty: Voice data remains on-premises, compliant with Indonesian healthcare regulations (UU No. 36/2009 on Health, PP 46/2014 on Health Information Systems)

Cost Structure Transformation

Deployment Model	Initial Cost	100K Monthly Queries	1M Annual Queries
Google Cloud Speech	$0	$600-$2,400	$7,200-$28,800
AWS Transcribe	$0	$720-$2,880	$8,640-$34,560
ARSA Embedded ASR	~$10,000	$0	$0

Payback period: 3.1-10.5 months depending on usage volume
5-year TCO savings: $36,000-$172,000 per deployment site

Customization Capability

Client retains full control over model retraining
Domain-specific vocabulary expansion (medical terminology, product names, regional dialects)
Inference parameter tuning without vendor dependency

Technical Differentiation

ARSA vs. Cloud ASR Providers

Latency: 50-80% reduction (eliminates network round-trip)
Privacy: 100% on-device processing
Cost predictability: Fixed CAPEX vs. variable OPEX

ARSA vs. Generic DeepSpeech Implementation

Turnkey embedded integration (hardware + firmware + software)
Indonesian language model pre-training
Production-ready VAD and audio pipeline (not research prototype)
30-day delivery vs. 6-12 month in-house development cycle

ARSA vs. Proprietary Embedded ASR (e.g., Nuance, Sensory)

70-85% lower licensing cost
Open-source foundation enables continuous improvement
No vendor lock-in for model updates or platform migration

Project Execution Structure

Deliverables Breakdown

Electronics R&D

Development board procurement: MediaTek/Allwinner/Intel-based boards
Programming rig assembly: custom flashing/testing rigs
Purpose: Hardware validation, SoC compatibility testing, production prototype development

Firmware/Kernel Development

SoC kernel configuration for peripheral management (I2C, SPI, GPIO, audio codecs)
EEPROM bootloader for standalone boot sequence
Driver integration for client-specific hardware sensors/actuators

Linux Software Integration

DeepSpeech model training on Indonesian corpus (primary value component)
Real-time inference pipeline with VAD
Demo application with GUI for client validation
Documentation: API specification, deployment guide, model retraining tutorial

Timeline & Milestones

Milestone	Duration	Days
Programming Rig Assembly	Week 1	1-5
Kernel Configuration	Week 2	6-10
DeepSpeech Integration Development	Week 3-4	11-20
Device Testing Iteration	Week 5-6	21-30

Project Management:

Weekly progress updates via email/video call
Iterative testing with client feedback integration
50% down payment, 50% post-delivery

Technical Deep Dive: Indonesian ASR Challenges

Embedded Deployment Constraints

ARM Platform Optimization:

CPU inference (no GPU/NPU): 4-core ARM Cortex-A53 @ 1.2-1.5GHz
RAM requirement: 1-2GB (model loading + inference buffer)
Storage: 500MB-1GB (model files + dependencies)
Thermal management: Passive cooling sufficient for continuous operation

Real-Time Performance:

Target: <1.5× real-time factor (1 second audio → <1.5 second processing)
Achieved: 0.5-1.2× RTF on Raspberry Pi 3B+, 0.3-0.8× RTF on RK3399
Optimization techniques: Quantization (FP32 → INT8), NEON SIMD acceleration

Strategic Implications for ARSA

Capability Demonstration

R&D Credibility:

Proven ability to adapt frontier AI research (DeepSpeech) to production embedded systems
Cross-disciplinary execution: electronics, firmware, ML model training, Linux software engineering
Indonesian language AI specialization (rare competency in regional market)

Enterprise Integration Expertise:

Hardware-software co-design for constrained embedded platforms
Client-specific customization within fixed timeline/budget
Production deployment readiness (not just research prototype)

Conclusion

ARSA’s Embedded ASR project for Botika represents high-value AI services delivery: combining open-source foundation (Mozilla DeepSpeech) with deep domain expertise (Indonesian language, embedded systems integration) to solve privacy-critical, cost-sensitive use cases.

Core Strengths: