How AI can help Indigenous language revitalization, and why data sovereignty is important
Indigenous language ex‐ perts working in computer science say Artificial Intelli‐ gence is a useful tool in language revitalization but communities must priori‐ tize the ownership of their data.
"It's just going to be like a pencil. It's useful but it's not going to save our language," said Michael Running Wolf, a former engineer for Ama‐ zon's Alexa and co-founder of Lakota AI Code camp, a sum‐ mer program for high school students where they gain ex‐ perience developing mobile apps that incorporate Indige‐ nous knowledge and methods.
Running Wolf is Lakota, Cheyenne and Blackfeet and grew up on the Northern Cheyenne reservation in Montana. Despite the lowtech home he was raised in often without running water or electricity - his mother, who engineered microchips for Hewlett-Packard, taught him math and physics by kerosene lamp.
"It was their [parents] per‐ spective that technology was not incompatible with Indige‐ nous ways of knowing," he said.
This, along with being sur‐ rounded by speakers of his traditional language while growing up, encouraged his current work using AI to help support Indigenous language revitalization.
There are limitations, Run‐ ning Wolf said, like sparse da‐ ta and the polysynthetic nature of many Indigenous languages.
An efficient AI, for exam‐ ple, can take 50,000 hours of English to create automatic speech recognition. Most In‐ digenous languages have so few speakers there is insuffi‐ cient data to train AI, he said, and AI cannot recognize or understand things it's never seen before and requires in‐ formation to replicate.
Also, languages such as Cheyenne and Blackfeet are polysynthetic and fusional, meaning prefixes and suffix‐ es blend into words so the roots are not apparent.
He said he intends to overcome these limitations by working with communities to develop a manageable da‐ ta set that will train AI.
"We generate 500 phrases in Makah and Kwak'wala, de‐ fined by the community and also the rules of the lan‐ guage, obviously, and we trained the AI to recognize those 500 phrases and those 500 phrases are used in cur‐ ricula," he said.
"So the goal here is that, when they go to a classroom, they get their exercise in per‐ son and then they can go home and practise using the AI."
Running Wolf emphasized the importance of the com‐ munity's agency in their lan‐ guage revitalization, particu‐ larly when it comes to AI.
"We have to have our own engineers. We need to have our own computer scientists using the software … We need to have sovereignty over our own data, set the terms and that's the only way to build this AI," he said.
He pointed to a recent dispute between the Stand‐ ing Rock Sioux and a corpo‐ ration that copyrighted their language materials. He said he's mindful of the reciprocal nature of giving back to the communities he's working with and said they're working with lawyers to create con‐ tracts that ensure any data collected remains with the community.
Running Wolf said AI re‐ quires a lot of data to up‐ grade and there are com‐ panies and academics who want a community's data be‐ cause there's potential revenue to sell to companies like Google, Microsoft and Meta.
'Sweat equity'
Robbie Jimerson, who has a
PhD in computing and infor‐ mation sciences and speaks Seneca, developed a Seneca and Oneida speech recogni‐ tion system as part of his dis‐ sertation. He agrees that people need to be guardians of their own data and said he is grateful for his time spent listening to the first language speakers in his community.
"To me, there's nothing better than, you know, hav‐ ing a conversation in Seneca with somebody," Jimerson said.
"There was a lot of sweat equity that went into it to train these models. You need a data set, right? So who's going to create those data sets…. For me, being a speaker, I was able to do both of those things."