Is the data collected ethically and with consent?

All data collection is consent-based and conducted under explicit data sharing agreements with individuals, cooperatives, and institutions. Government archive digitisation is conducted under formal agreements with the relevant authorities. Data subjects retain rights to correction and removal under GDPR-equivalent standards.

Can I access specific datasets for research?

Maestro AI Labs offers research partnership agreements for academic institutions and AI labs. Contact ceo@maestrosai.com with details of your research and the datasets you need.

What languages are in the 47-language dataset?

The dataset covers Caribbean English Creoles, Haitian Creole, Papiamentu, Sranan Tongo, and major Caribbean Spanish and French dialects, plus indigenous languages from LATAM, Pacific island territories, and Sub-Saharan Africa operating regions. A full language list is available under research partnership agreements.

What is the 28-territory government archive?

Maestro AI Labs has formal agreements with 28 Caribbean territories to digitise and structure pre-digital government administrative records. These include economic census data, land registry records, population records, and public financial data dating back in some cases to the 1950s.

How is the data structured for AI training use?

Datasets are delivered in formats compatible with standard ML training pipelines: JSON, CSV, and Parquet for structured data. Audio and text datasets follow standard linguistic corpus formats. Documentation accompanies every dataset with field definitions, collection methodology, and data quality assessments.

Why is 92% of this data unavailable elsewhere?

The data exists in physical archives, oral tradition, and community cooperative records that have never been digitised and are not accessible through web scraping or standard data collection methods. Accessing it requires years of relationship-building with institutions, physical presence in communities, and the technical infrastructure to convert analogue records to structured formats.

Data Archaeology: The Field Work That Makes Every Other AI Product Possible

2.3M+

Total structured records collected

Indigenous language datasets active

Caribbean territories with archive access

Informal market structures mapped

92%

Data absent from any other AI training set

4.2 billion people transact, communicate, borrow, save, and live inside systems that the global AI training pipeline has never touched. The data exists. No one has gone to collect it.

Government archives in 28 Caribbean territories contain decades of economic, demographic, and administrative records that have never been digitised. 47 indigenous languages are spoken across the regions Maestro AI Labs covers. Pre-digital financial records from rotating savings associations, community cooperatives, and agricultural networks sit in physical files in offices across 36 economies.

Data Archaeology is the team and infrastructure that collects, structures, and converts that signal into training-ready datasets. It is also the competitive foundation beneath every other product Maestro AI Labs builds. Credit Garden depends on the SUSU records, Harmonics on the regional knowledge graphs, and OYA AI on the Caribbean Sea climate data. Take away the archive and none of those products exist.

What We Collect

Government archives: Pre-digital administrative records from 28 Caribbean territories. Land registry data, economic census records, public financial records, demographic surveys. Most of this material has never been accessible to researchers because it exists only in physical form in government offices. Maestro AI Labs has built the institutional relationships and digitisation infrastructure to convert it.

Informal financial records: Rotating savings associations (SUSUs, tandas, paluwagans), community lending circles, agricultural cooperative records, and mobile money transaction histories. This is the economic behaviour of the credit-invisible population, collected with consent and structured for machine learning. This data feeds Credit Garden's scoring model directly.

Language and communication data: 47 indigenous and creole language datasets covering Caribbean, LATAM, African, and Pacific languages. Most of these languages have no substantial machine learning training corpus. Maestro AI Labs' datasets represent the first structured training material for many of them. These power Harmonics agents' ability to operate correctly in regional language contexts.

Geospatial and climate data: Sub-kilometre resolution Caribbean Sea surface temperature data, atmospheric pressure records, storm track histories, and land use data that does not exist in global datasets at meaningful resolution for the Caribbean Basin. This feeds OYA AI directly.

"You cannot replicate this with 100 data scientists in San Francisco. The data is not online. The government archives are not digitised. The community credit records require years of relationship-building to access. Maestro AI Labs has already done that work. A new entrant starts five years behind."

Who Buys This

The global AI training data market is $2.3 billion and growing at 23% annually. The single largest gap in that market is data from emerging economies representing 4.2 billion people who are effectively absent from current training sets.

AI labs training the next foundation models face the same problem: their models cannot read a SUSU participation record and have no training signal from 36 of the world's economies. Data Archaeology provides the raw material to fix that gap.

Development banks including IDB, World Bank, and Caribbean Development Bank make investment decisions about economies where they have limited ground-truth data. Structured datasets from Data Archaeology improve the accuracy of economic modelling, poverty mapping, and program impact assessment.

Academic research institutions studying Caribbean, LATAM, and Pacific languages are direct buyers for the 47 indigenous language datasets, which represent original linguistic research that has never existed in machine-accessible form.

Governments and national statistics offices in the regions covered can use structured historical datasets to improve policy modelling, infrastructure planning, and development targeting.

Revenue Model and Internal Value

Data Archaeology generates revenue through licensing to AI training partners, research institution data access agreements, and government data partnership contracts. The external licensing market is significant.

The internal contribution is compounding. Every new Data Archaeology collection improves the performance of every other Maestro AI Labs product. That is not a cost centre. It is the foundation of every competitive moat the company holds. Each new data pipeline makes Credit Garden more accurate, Harmonics more knowledgeable, OYA AI more precise, and Global Safety Score wider in coverage. Sell the data once, and it keeps paying inside every product the company ships.

Data Archaeology: The Field Work That Makes Every Other AI Product Possible

What We Collect

Who Buys This

Revenue Model and Internal Value

Is the data collected ethically and with consent?

Can I access specific datasets for my research?

What languages are in the 47-language dataset?

What is the 28-territory government archive?

How is the data structured for AI training use?

Why is 92% of this data unavailable elsewhere?

Credit Garden

Harmonics AI Agents

OYA AI