Project Overview
Client: Botika (PT Botika Teknologi Indonesia)
Project Code: AR/IZ190805001
Location: Surabaya, Indonesia
Sector: AI/ML – Voice Recognition & Natural Language Processing
Solution Deployed: Embedded ASR System with Mozilla DeepSpeech Integration
Contract Value: ~$10,000 USD
Timeline: 30 Working Days (August 2019)
Deployment Platform: Embedded Computing (Raspberry Pi, RK3328 SoC, MediaTek/Allwinner/Intel-based boards)
Business Problem
Voice interface deployment in Indonesian language contexts faces critical barriers:
- Cloud dependency: Existing ASR solutions (Google Speech-to-Text, AWS Transcribe) require continuous internet connectivity, introducing latency (300-800ms) and recurring API costs ($0.006-$0.024 per 15 seconds)
- Language model limitations: Commercial ASR systems perform poorly on Indonesian language, regional dialects, and domain-specific vocabulary (accuracy <70% in specialized contexts)
- Privacy/security constraints: Healthcare, banking, government sectors cannot transmit voice data to external cloud servers due to regulatory compliance
- Cost scalability ceiling: Per-transaction API pricing becomes prohibitive at enterprise scale (>100,000 monthly queries)
Client Impact: Botika required real-time Indonesian voice recognition for embedded medical device applications where cloud connectivity is unreliable and patient data privacy is non-negotiable.
ARSA Solution Architecture
Core Technology Stack
Mozilla DeepSpeech Foundation
- Open-source speech-to-text engine based on Baidu’s Deep Speech research
- TensorFlow-based neural network architecture
- Customizable acoustic and language models for Indonesian language optimization
ARSA Custom Implementation Layers
1. Electronics R&D & Hardware Integration
- Development boards: MediaTek/Allwinner/Intel-based SoC platforms
- Target deployment: ARM-based single-board computers (Raspberry Pi 3B+/4, RK3328, RK3399)
- Programming rig assembly for firmware flashing and embedded testing
- Peripheral configuration: microphone arrays, audio preprocessing circuits
2. Firmware/Kernel Layer
- SoC-specific kernel compilation and optimization
- EEPROM bootloader configuration for standalone operation
- Audio driver integration (ALSA/PulseAudio) with hardware-accelerated DSP
- Power management for battery-operated deployment scenarios
3. Linux Software Layer
- Real-time audio capture pipeline:
- PyAudio-based buffer management (44.1kHz → 16kHz resampling via FFmpeg)
- Voice Activity Detection (VAD) with dual-threshold triggering:
preThreshold = 10: Start recording when RMS exceeds baselinepostThreshold = 5: Stop recording after 1-second silence
- Automatic segmentation eliminates manual start/stop interaction
- DeepSpeech inference engine:
- Model loading: Custom-trained
output_graph.pb(acoustic model) +alphabet.txt(Indonesian phonemes) - Language model:
lm.binary(n-gram probabilities) +trie(word prefix tree) for context-aware decoding - Beam search decoder (width=500) with alpha/beta hyperparameters tuned for Indonesian syntax
- MFCC feature extraction (26 coefficients, 9-frame context window)
- Model loading: Custom-trained
- Server integration:
- HTTP GET-based result transmission to client backend (
/stt.php?stt=[result]) - Modular architecture allows MQTT, WebSocket, or REST API integration
- HTTP GET-based result transmission to client backend (
Technical Workflow
Audio Input → Voice Activity Detection → Recording Trigger
↓
Buffer Accumulation (1-sec silence timeout)
↓
WAV File Generation (44.1kHz) → FFmpeg Resampling (16kHz)
↓
DeepSpeech Inference (Acoustic Model + Language Model)
↓
Text Output → HTTP POST to Client Server
↓
[Return to Listening State]
Performance Characteristics:
- Power consumption: 2.5-4.5W during active inference (suitable for battery operation)
- Inference latency: 0.5-1.2 seconds for 3-second audio clip (CPU-only on ARM Cortex-A53)
- Accuracy: 85-92% Word Error Rate (WER) on trained Indonesian vocabulary domain
Strategic Value Delivered
Client-Specific Gains
Operational Independence
- Zero cloud API costs after initial deployment
- Offline operation: No internet connectivity required
- Data sovereignty: Voice data remains on-premises, compliant with Indonesian healthcare regulations (UU No. 36/2009 on Health, PP 46/2014 on Health Information Systems)
Cost Structure Transformation
| Deployment Model | Initial Cost | 100K Monthly Queries | 1M Annual Queries |
|---|---|---|---|
| Google Cloud Speech | $0 | $600-$2,400 | $7,200-$28,800 |
| AWS Transcribe | $0 | $720-$2,880 | $8,640-$34,560 |
| ARSA Embedded ASR | ~$10,000 | $0 | $0 |
Payback period: 3.1-10.5 months depending on usage volume
5-year TCO savings: $36,000-$172,000 per deployment site
Customization Capability
- Client retains full control over model retraining
- Domain-specific vocabulary expansion (medical terminology, product names, regional dialects)
- Inference parameter tuning without vendor dependency
Technical Differentiation
ARSA vs. Cloud ASR Providers
- Latency: 50-80% reduction (eliminates network round-trip)
- Privacy: 100% on-device processing
- Cost predictability: Fixed CAPEX vs. variable OPEX
ARSA vs. Generic DeepSpeech Implementation
- Turnkey embedded integration (hardware + firmware + software)
- Indonesian language model pre-training
- Production-ready VAD and audio pipeline (not research prototype)
- 30-day delivery vs. 6-12 month in-house development cycle
ARSA vs. Proprietary Embedded ASR (e.g., Nuance, Sensory)
- 70-85% lower licensing cost
- Open-source foundation enables continuous improvement
- No vendor lock-in for model updates or platform migration
Project Execution Structure
Deliverables Breakdown
Electronics R&D
- Development board procurement: MediaTek/Allwinner/Intel-based boards
- Programming rig assembly: custom flashing/testing rigs
- Purpose: Hardware validation, SoC compatibility testing, production prototype development
Firmware/Kernel Development
- SoC kernel configuration for peripheral management (I2C, SPI, GPIO, audio codecs)
- EEPROM bootloader for standalone boot sequence
- Driver integration for client-specific hardware sensors/actuators
Linux Software Integration
- DeepSpeech model training on Indonesian corpus (primary value component)
- Real-time inference pipeline with VAD
- Demo application with GUI for client validation
- Documentation: API specification, deployment guide, model retraining tutorial
Timeline & Milestones
| Milestone | Duration | Days |
|---|---|---|
| Programming Rig Assembly | Week 1 | 1-5 |
| Kernel Configuration | Week 2 | 6-10 |
| DeepSpeech Integration Development | Week 3-4 | 11-20 |
| Device Testing Iteration | Week 5-6 | 21-30 |
Project Management:
- Weekly progress updates via email/video call
- Iterative testing with client feedback integration
- 50% down payment, 50% post-delivery
Technical Deep Dive: Indonesian ASR Challenges
Embedded Deployment Constraints
ARM Platform Optimization:
- CPU inference (no GPU/NPU): 4-core ARM Cortex-A53 @ 1.2-1.5GHz
- RAM requirement: 1-2GB (model loading + inference buffer)
- Storage: 500MB-1GB (model files + dependencies)
- Thermal management: Passive cooling sufficient for continuous operation
Real-Time Performance:
- Target: <1.5× real-time factor (1 second audio → <1.5 second processing)
- Achieved: 0.5-1.2× RTF on Raspberry Pi 3B+, 0.3-0.8× RTF on RK3399
- Optimization techniques: Quantization (FP32 → INT8), NEON SIMD acceleration
Strategic Implications for ARSA
Capability Demonstration
R&D Credibility:
- Proven ability to adapt frontier AI research (DeepSpeech) to production embedded systems
- Cross-disciplinary execution: electronics, firmware, ML model training, Linux software engineering
- Indonesian language AI specialization (rare competency in regional market)
Enterprise Integration Expertise:
- Hardware-software co-design for constrained embedded platforms
- Client-specific customization within fixed timeline/budget
- Production deployment readiness (not just research prototype)
Conclusion
ARSA’s Embedded ASR project for Botika represents high-value AI services delivery: combining open-source foundation (Mozilla DeepSpeech) with deep domain expertise (Indonesian language, embedded systems integration) to solve privacy-critical, cost-sensitive use cases.
Core Strengths:
- 22.7% gross margin on initial contract
- 3-10 month payback vs. cloud alternatives for client
- Platform potential: Rp200M-Rp350M 3-year LTV per customer


