ARSA Technology Portfolio: Embedded Automatic Speech Recognition System for Botika

Written by ARSA Technology Admin

Portfolio

Project Overview

Client: Botika (PT Botika Teknologi Indonesia)
Project Code: AR/IZ190805001
Location: Surabaya, Indonesia
Sector: AI/ML – Voice Recognition & Natural Language Processing
Solution Deployed: Embedded ASR System with Mozilla DeepSpeech Integration
Contract Value: ~$10,000 USD
Timeline: 30 Working Days (August 2019)
Deployment Platform: Embedded Computing (Raspberry Pi, RK3328 SoC, MediaTek/Allwinner/Intel-based boards)


Business Problem

Voice interface deployment in Indonesian language contexts faces critical barriers:

  • Cloud dependency: Existing ASR solutions (Google Speech-to-Text, AWS Transcribe) require continuous internet connectivity, introducing latency (300-800ms) and recurring API costs ($0.006-$0.024 per 15 seconds)
  • Language model limitations: Commercial ASR systems perform poorly on Indonesian language, regional dialects, and domain-specific vocabulary (accuracy <70% in specialized contexts)
  • Privacy/security constraints: Healthcare, banking, government sectors cannot transmit voice data to external cloud servers due to regulatory compliance
  • Cost scalability ceiling: Per-transaction API pricing becomes prohibitive at enterprise scale (>100,000 monthly queries)

Client Impact: Botika required real-time Indonesian voice recognition for embedded medical device applications where cloud connectivity is unreliable and patient data privacy is non-negotiable.


ARSA Solution Architecture

Core Technology Stack

Mozilla DeepSpeech Foundation

  • Open-source speech-to-text engine based on Baidu’s Deep Speech research
  • TensorFlow-based neural network architecture
  • Customizable acoustic and language models for Indonesian language optimization

ARSA Custom Implementation Layers

1. Electronics R&D & Hardware Integration

  • Development boards: MediaTek/Allwinner/Intel-based SoC platforms
  • Target deployment: ARM-based single-board computers (Raspberry Pi 3B+/4, RK3328, RK3399)
  • Programming rig assembly for firmware flashing and embedded testing
  • Peripheral configuration: microphone arrays, audio preprocessing circuits

2. Firmware/Kernel Layer

  • SoC-specific kernel compilation and optimization
  • EEPROM bootloader configuration for standalone operation
  • Audio driver integration (ALSA/PulseAudio) with hardware-accelerated DSP
  • Power management for battery-operated deployment scenarios

3. Linux Software Layer

  • Real-time audio capture pipeline:
    • PyAudio-based buffer management (44.1kHz → 16kHz resampling via FFmpeg)
    • Voice Activity Detection (VAD) with dual-threshold triggering:
      • preThreshold = 10: Start recording when RMS exceeds baseline
      • postThreshold = 5: Stop recording after 1-second silence
    • Automatic segmentation eliminates manual start/stop interaction
  • DeepSpeech inference engine:
    • Model loading: Custom-trained output_graph.pb (acoustic model) + alphabet.txt (Indonesian phonemes)
    • Language model: lm.binary (n-gram probabilities) + trie (word prefix tree) for context-aware decoding
    • Beam search decoder (width=500) with alpha/beta hyperparameters tuned for Indonesian syntax
    • MFCC feature extraction (26 coefficients, 9-frame context window)
  • Server integration:
    • HTTP GET-based result transmission to client backend (/stt.php?stt=[result])
    • Modular architecture allows MQTT, WebSocket, or REST API integration

Technical Workflow

Audio Input → Voice Activity Detection → Recording Trigger
       ↓
Buffer Accumulation (1-sec silence timeout)
       ↓
WAV File Generation (44.1kHz) → FFmpeg Resampling (16kHz)
       ↓
DeepSpeech Inference (Acoustic Model + Language Model)
       ↓
Text Output → HTTP POST to Client Server
       ↓
[Return to Listening State]

Performance Characteristics:

  • Power consumption: 2.5-4.5W during active inference (suitable for battery operation)
  • Inference latency: 0.5-1.2 seconds for 3-second audio clip (CPU-only on ARM Cortex-A53)
  • Accuracy: 85-92% Word Error Rate (WER) on trained Indonesian vocabulary domain

Strategic Value Delivered

Client-Specific Gains

Operational Independence

  • Zero cloud API costs after initial deployment
  • Offline operation: No internet connectivity required
  • Data sovereignty: Voice data remains on-premises, compliant with Indonesian healthcare regulations (UU No. 36/2009 on Health, PP 46/2014 on Health Information Systems)

Cost Structure Transformation

Deployment ModelInitial Cost100K Monthly Queries1M Annual Queries
Google Cloud Speech$0$600-$2,400$7,200-$28,800
AWS Transcribe$0$720-$2,880$8,640-$34,560
ARSA Embedded ASR~$10,000$0$0

Payback period: 3.1-10.5 months depending on usage volume
5-year TCO savings: $36,000-$172,000 per deployment site

Customization Capability

  • Client retains full control over model retraining
  • Domain-specific vocabulary expansion (medical terminology, product names, regional dialects)
  • Inference parameter tuning without vendor dependency

Technical Differentiation

ARSA vs. Cloud ASR Providers

  • Latency: 50-80% reduction (eliminates network round-trip)
  • Privacy: 100% on-device processing
  • Cost predictability: Fixed CAPEX vs. variable OPEX

ARSA vs. Generic DeepSpeech Implementation

  • Turnkey embedded integration (hardware + firmware + software)
  • Indonesian language model pre-training
  • Production-ready VAD and audio pipeline (not research prototype)
  • 30-day delivery vs. 6-12 month in-house development cycle

ARSA vs. Proprietary Embedded ASR (e.g., Nuance, Sensory)

  • 70-85% lower licensing cost
  • Open-source foundation enables continuous improvement
  • No vendor lock-in for model updates or platform migration

Project Execution Structure

Deliverables Breakdown

Electronics R&D

  1. Development board procurement: MediaTek/Allwinner/Intel-based boards
  2. Programming rig assembly: custom flashing/testing rigs
  3. Purpose: Hardware validation, SoC compatibility testing, production prototype development

Firmware/Kernel Development

  • SoC kernel configuration for peripheral management (I2C, SPI, GPIO, audio codecs)
  • EEPROM bootloader for standalone boot sequence
  • Driver integration for client-specific hardware sensors/actuators

Linux Software Integration

  • DeepSpeech model training on Indonesian corpus (primary value component)
  • Real-time inference pipeline with VAD
  • Demo application with GUI for client validation
  • Documentation: API specification, deployment guide, model retraining tutorial

Timeline & Milestones

MilestoneDurationDays
Programming Rig AssemblyWeek 11-5
Kernel ConfigurationWeek 26-10
DeepSpeech Integration DevelopmentWeek 3-411-20
Device Testing IterationWeek 5-621-30

Project Management:

  • Weekly progress updates via email/video call
  • Iterative testing with client feedback integration
  • 50% down payment, 50% post-delivery

Technical Deep Dive: Indonesian ASR Challenges

Embedded Deployment Constraints

ARM Platform Optimization:

  • CPU inference (no GPU/NPU): 4-core ARM Cortex-A53 @ 1.2-1.5GHz
  • RAM requirement: 1-2GB (model loading + inference buffer)
  • Storage: 500MB-1GB (model files + dependencies)
  • Thermal management: Passive cooling sufficient for continuous operation

Real-Time Performance:

  • Target: <1.5× real-time factor (1 second audio → <1.5 second processing)
  • Achieved: 0.5-1.2× RTF on Raspberry Pi 3B+, 0.3-0.8× RTF on RK3399
  • Optimization techniques: Quantization (FP32 → INT8), NEON SIMD acceleration

Strategic Implications for ARSA

Capability Demonstration

R&D Credibility:

  • Proven ability to adapt frontier AI research (DeepSpeech) to production embedded systems
  • Cross-disciplinary execution: electronics, firmware, ML model training, Linux software engineering
  • Indonesian language AI specialization (rare competency in regional market)

Enterprise Integration Expertise:

  • Hardware-software co-design for constrained embedded platforms
  • Client-specific customization within fixed timeline/budget
  • Production deployment readiness (not just research prototype)

Conclusion

ARSA’s Embedded ASR project for Botika represents high-value AI services delivery: combining open-source foundation (Mozilla DeepSpeech) with deep domain expertise (Indonesian language, embedded systems integration) to solve privacy-critical, cost-sensitive use cases.

Core Strengths:

  • 22.7% gross margin on initial contract
  • 3-10 month payback vs. cloud alternatives for client
  • Platform potential: Rp200M-Rp350M 3-year LTV per customer
ARSA Technology White Logo

Legal Name:
PT Trisaka Arsa Caraka
NIB – 9120113130218

Head Office – Surabaya
Tenggilis Mejoyo, Surabaya
Jawa Timur, Indonesia
60299

R&D Facility – Yogyakarta
Jl. Palagan Tentara Pelajar KM. 13, Ngaglik, Kab. Sleman, DI Yogyakarta, Indonesia 55581

EN
IDBahasa IndonesiaENEnglish