Hola, este es mi tercer artículo sobre Habré, antes escribí un artículo sobre el modelo de lenguaje ALM . Ahora, quiero presentarles el sistema de corrección de errores tipográficos ASC (implementado sobre la base de ALM ).
Sí, hay una gran cantidad de sistemas para corregir errores tipográficos, todos tienen sus propias fortalezas y debilidades, de los sistemas abiertos puedo destacar uno de los JamSpell más prometedores , y lo compararemos. También hay un sistema similar de DeepPavlov , en el que muchos podrían pensar, pero nunca me hice amigo de él.
Lista de características:
- Corrección de errores en palabras con una diferencia de hasta 4 distancias de Levenshtein.
- Corrección de errores tipográficos en palabras (inserción, borrado, reemplazo, reordenamiento) de caracteres.
- ficación dado el contexto.
- Poniendo el caso de la primera letra de la palabra para (nombres propios y títulos), teniendo en cuenta el contexto.
- Dividir las palabras combinadas en palabras separadas, teniendo en cuenta el contexto.
- Realiza análisis de texto sin corregir el texto original.
- Busque en el texto la presencia (errores, errores tipográficos, contexto incorrecto).
Sistemas operativos compatibles:
- Mac OS X
- FreeBSD
- Linux
El sistema está escrito en C ++ 11, hay un puerto para Python3
Diccionarios listos
Nombre | Tamaño (GB) | RAM (GB) | Tamaño N-gramos | Idioma |
---|---|---|---|---|
wittenbell-3-big.asc | 1,97 | 15,6 | 3 | RU |
wittenbell-3-middle.asc | 1,24 | 9,7 | 3 | RU |
mkneserney-3-middle.asc | 1,33 | 9,7 | 3 | RU |
wittenbell-3-single.asc | 0,772 | 5.14 | 3 | RU |
wittenbell-5-single.asc | 1,37 | 10,7 | cinco | RU |
Pruebas
Para probar el sistema se utilizaron datos del concurso de "corrección de errores tipográficos" Dialog21 de 2016 . Se utilizó un diccionario binario entrenado para las pruebas: wittenbell-3-middle.asc
Prueba realizada | Precisión | Recordar | Medida |
---|---|---|---|
Modo de corrección de errores tipográficos | 76,97 | 62,71 | 69.11 |
Modo de corrección de errores | 73,72 | 60,53 | 66,48 |
Creo que es innecesario agregar otros datos, si lo desea, todos pueden repetir la prueba, adjunto todos los materiales utilizados en la prueba a continuación.
Materiales utilizados en las pruebas
- test.txt : texto para probar
- correct.txt : texto de las variantes correctas
- evalua.py: secuencia de comandos de Python3 para calcular los resultados de la corrección
Ahora, es interesante comparar el funcionamiento de los sistemas para corregir errores tipográficos en igualdad de condiciones, entrenaremos dos errores tipográficos diferentes sobre los mismos datos de texto y realizaremos una prueba.
A modo de comparación, tomemos el sistema de corrección de errores tipográficos que mencioné anteriormente, JamSpell .
ASC vs JamSpell
Instalación
ASC
JamSpell
$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
JamSpell
$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell
$ mkdir ./build
$ cd ./build
$ cmake ..
$ make
Formación
ASC
train.json
Python3
JamSpell
train.json
{
"ext": "txt",
"size": 3,
"alter": {"":""},
"debug": 1,
"threads": 0,
"method": "train",
"allow-unk": true,
"reset-unk": true,
"confidence": true,
"interpolate": true,
"mixed-dicts": true,
"only-token-words": true,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"corpus": "./texts/correct.txt",
"w-bin": "./dictionary/3-middle.asc",
"w-vocab": "./train/lm.vocab",
"w-arpa": "./train/lm.arpa",
"mix-restwords": "./similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
Python3
import asc
asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
def statusArpa1(status):
print("Build arpa", status)
def statusArpa2(status):
print("Write arpa", status)
def statusVocab(status):
print("Write vocab", status)
def statusIndex(text, status):
print(text, status)
def status(text, status):
print(text, status)
asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)
asc.buildArpa(statusArpa1)
asc.writeArpa("./train/lm.arpa", statusArpa2)
asc.writeVocab("./train/lm.vocab", statusVocab)
asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
JamSpell
$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin
Pruebas
ASC
spell.json
Python3
JamSpell
- Python , C++
spell.json
{
"debug": 1,
"threads": 0,
"method": "spell",
"spell-verbose": true,
"confidence": true,
"mixed-dicts": true,
"asc-split": true,
"asc-alter": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"asc-wordrep": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"r-bin": "./dictionary/3-middle.asc"
}
$ ./asc -r-json ./spell.json
Python3
import asc
asc.setAlmV2()
asc.setThreads(0)
asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)
def status(text, status):
print(text, status)
asc.loadIndex("./dictionary/3-middle.asc", "", status)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
JamSpell
- Python , C++
#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>
// BOOST
#ifdef USE_BOOST_CONVERT
#include <boost/locale/encoding_utf.hpp>
//
#else
#include <codecvt>
#endif
using namespace std;
/**
* convert utf-8
* @param str utf-8
* @return
*/
const string convert(const wstring & str){
//
string result = "";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//
#else
// UTF-8
using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
//
wstring_convert <convert_type, wchar_t> conv;
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
// utf-8
result = conv.to_bytes(str);
#endif
}
//
return result;
}
/**
* convert utf-8
* @param str
* @return utf-8
*/
const wstring convert(const string & str){
//
wstring result = L"";
//
if(!str.empty()){
// BOOST
#ifdef USE_BOOST_CONVERT
//
using boost::locale::conv::utf_to_utf;
// utf-8
result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//
#else
//
// wstring_convert <codecvt_utf8 <wchar_t>> conv;
wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
// utf-8
result = conv.from_bytes(str);
#endif
}
//
return result;
}
/**
* safeGetline
* @param is
* @param t
* @return
*/
istream & safeGetline(istream & is, string & t){
//
t.clear();
istream::sentry se(is, true);
streambuf * sb = is.rdbuf();
for(;;){
int c = sb->sbumpc();
switch(c){
case '\n': return is;
case '\r':
if(sb->sgetc() == '\n') sb->sbumpc();
return is;
case streambuf::traits_type::eof():
if(t.empty()) is.setstate(ios::eofbit);
return is;
default: t += (char) c;
}
}
}
/**
* main
*/
int main(){
//
NJamSpell::TSpellCorrector corrector;
//
corrector.LoadLangModel("model.bin");
//
ifstream file1("./test_data/test.txt", ios::in);
//
if(file1.is_open()){
//
string line = "", res = "";
//
ofstream file2("./test_data/output.txt", ios::out);
//
if(file2.is_open()){
//
while(file1.good()){
//
safeGetline(file1, line);
// ,
if(!line.empty()){
//
res = convert(corrector.FixFragment(convert(line)));
// ,
if(!res.empty()){
//
res.append("\n");
//
file2.write(res.c_str(), res.size());
}
}
}
//
file2.close();
}
//
file1.close();
}
return 0;
}
$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test
$ ./bin/test
resultados
Obteniendo resultados
$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt
ASC
Precisión | Recordar | Medida |
---|---|---|
92,13 | 82,51 | 87.05 |
JamSpell
Precisión | Recordar | Medida |
---|---|---|
77,87 | 63,36 | 69,87 |
Una de las principales características de ASC es aprender de los datos sucios. Es prácticamente imposible encontrar corpus de texto sin errores y errores tipográficos en el acceso abierto. No basta con arreglar terabytes de datos a mano, pero es necesario trabajar con ellos de alguna manera.
El principio de enseñanza que ofrezco
- Elaboración de un modelo de lenguaje utilizando datos sucios
- Eliminamos todas las palabras raras y N-gramas en el modelo de lenguaje ensamblado
- Agregamos palabras sueltas para un funcionamiento más correcto del sistema de corrección de errores tipográficos.
- Armar un diccionario binario
Empecemos
Supongamos que tenemos varios corpus de diferentes asignaturas, lo más lógico es entrenarlos por separado, luego combinarlos.
Ensamblar el chasis usando ALM
collect.json
Python
,
{
"size": 3,
"debug": 1,
"threads": 0,
"ext": "txt",
"method": "train",
"allow-unk": true,
"mixed-dicts": true,
"only-token-words": true,
"smoothing": "wittenbell",
"locale": "en_US.UTF-8",
"w-abbr": "./output/alm.abbr",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"w-words": "./output/words.txt",
"corpus": "./texts/corpus",
"abbrs": "./abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./collect.json
- size — N- 3
- debug —
- threads —
- ext —
- allow-unk — 〈unk〉
- mixed-dicts —
- only-token-words — N- —
- smoothing — wittenbell ( , - )
- locale — ( )
- w-abbr —
- w-map —
- w-vocab —
- w-words — ( )
- corpus —
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- mix-restwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# N- —
alm.setOption(alm.options_t.tokenWords)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
# , , (, , ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
alm.addAbbr(abbr)
f.close()
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def status(text, status):
print(text, status)
def statusWords(status):
print("Write words", status)
def statusVocab(status):
print("Write vocab", status)
def statusMap(status):
print("Write map", status)
def statusSuffix(status):
print("Write suffix", status)
#
alm.collectCorpus("./texts/corpus", status)
#
alm.writeWords("./output/words.txt", statusWords)
#
alm.writeVocab("./output/alm.vocab", statusVocab)
#
alm.writeMap("./output/alm.map", statusMap)
#
alm.writeSuffix("./output/alm.abbr", statusSuffix)
,
Poda de un casco ensamblado con ALM
prune.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "vprune",
"vprune-wltf": -15.0,
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./corpus1/alm.map",
"r-vocab": "./corpus1/alm.vocab",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./prune.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- vprune-wltf — - (, — )
- locale — ( )
- smoothing — wittenbell ( , - )
- r-map —
- r-vocab —
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# <unk>
alm.setOption(alm.options_t.allowUnk)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
def statusPrune(status):
print("Prune data", status)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#
alm.readMap("./corpus1/alm.map", statusReadMap)
#
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Combinando datos recopilados con ALM
merge.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"method": "merge",
"mixed-dicts": "true",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-words": "./texts/words",
"r-map": "./corpus1",
"r-vocab": "./corpus1",
"w-map": "./output/alm.map",
"w-vocab": "./output/alm.vocab",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./merge.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- mixed-dicts —
- locale — ( )
- smoothing — wittenbell ( , - )
- r-words —
- r-map — ,
- r-vocab — ,
- w-map —
- w-vocab —
- goodwords —
- badwords —
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
#
alm.setOption(alm.options_t.mixDicts)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addBadword(word)
f.close()
#
f = open('./texts/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
alm.addWord(word)
f.close()
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusWriteVocab(status):
print("Write vocab", status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusWriteMap(status):
print("Write map", status)
#
alm.readVocab("./corpus1", statusReadVocab)
#
alm.readMap("./corpus1", statusReadMap)
#
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#
alm.writeMap("./output/alm.map", statusWriteMap)
Aprendiendo el modelo de lenguaje con ALM
train.json
Python
{
"size": 3,
"debug": 1,
"allow-unk": true,
"reset-unk": true,
"interpolate": true,
"method": "train",
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"r-map": "./output/alm.map",
"r-vocab": "./output/alm.vocab",
"w-arpa": "./output/alm.arpa",
"w-words": "./output/words.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz"
}
$ ./alm -r-json ./train.json
- size — N- 3
- debug —
- allow-unk — 〈unk〉
- reset-unk — , 〈unk〉
- interpolate —
- locale — ( )
- smoothing — wittenbell
- r-map — ,
- r-vocab — ,
- w-arpa — ARPA,
- w-words — , ( )
- alphabet — ( )
Python
import alm
# N- 3
alm.setSize(3)
#
alm.setThreads(0)
# ( )
alm.setLocale("en_US.UTF-8")
# ( )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
# <unk>
alm.setOption(alm.options_t.allowUnk)
# <unk>
alm.setOption(alm.options_t.resetUnk)
#
alm.setOption(alm.options_t.mixDicts)
#
alm.setOption(alm.options_t.interpolate)
# wittenbell ( , - )
alm.init(alm.smoothing_t.wittenBell)
def statusReadVocab(text, status):
print("Read vocab", text, status)
def statusReadMap(text, status):
print("Read map", text, status)
def statusBuildArpa(status):
print("Build ARPA", status)
def statusWriteMap(status):
print("Write map", status)
def statusWriteArpa(status):
print("Write ARPA", status)
def statusWords(status):
print("Write words", status)
#
alm.readVocab("./output/alm.vocab", statusReadVocab)
#
alm.readMap("./output/alm.map", statusReadMap)
#
alm.buildArpa(statusBuildArpa)
# ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)
#
alm.writeWords("./output/words.txt", statusWords)
Entrenamiento de corrector ortográfico ASC
train.json
Python
{
"size": 3,
"debug": 1,
"threads": 0,
"confidence": true,
"mixed-dicts": true,
"method": "train",
"alter": {"":""},
"locale": "en_US.UTF-8",
"smoothing": "wittenbell",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"w-bin": "./dictionary/3-single.asc",
"r-abbr": "./output/alm.abbr",
"r-vocab": "./output/alm.vocab",
"r-arpa": "./output/alm.arpa",
"abbrs": "./texts/abbrs/abbrs.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"alters": "./texts/alters/yoficator.txt",
"upwords": "./texts/words/upp",
"mix-restwords": "./texts/similars/letters.txt",
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"bin-code": "ru",
"bin-name": "Russian",
"bin-author": "You name",
"bin-copyright": "You company LLC",
"bin-contacts": "site: https://example.com, e-mail: info@example.com",
"bin-lictype": "MIT",
"bin-lictext": "... License text ...",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./train.json
- size — N- 3
- debug —
- threads —
- confidence — ARPA - ,
- mixed-dicts —
- alter — ( , , — «»)
- locale — ( )
- smoothing — wittenbell ( , - )
- pilots — ( )
- w-bin —
- r-abbr — ,
- r-vocab — ,
- r-arpa — ARPA,
- abbrs — , , (, , ...)
- goodwords —
- badwords —
- alters — , ( )
- upwords — , (, , ...)
- mix-restwords —
- alphabet — ( )
- bin-code —
- bin-name —
- bin-author —
- bin-copyright —
- bin-contacts —
- bin-lictype —
- bin-lictext —
- embedding-size —
- embedding — ( , )
Python
import asc
# N- 3
asc.setSize(3)
#
asc.setThreads(0)
# ( )
asc.setLocale("en_US.UTF-8")
#
asc.setOption(asc.options_t.uppers)
# <unk>
asc.setOption(asc.options_t.allowUnk)
# <unk>
asc.setOption(asc.options_t.resetUnk)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusIndex(text, status):
print(text, status)
def statusBuildIndex(status):
print("Build index", status)
def statusArpa(status):
print("Read arpa", status)
def statusVocab(status):
print("Read vocab", status)
# ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#
asc.readVocab("./output/alm.vocab", statusVocab)
#
asc.setCode("RU")
#
asc.setLictype("MIT")
#
asc.setName("Russian")
#
asc.setAuthor("You name")
#
asc.setCopyright("You company LLC")
#
asc.setLictext("... License text ...")
#
asc.setContacts("site: https://example.com, e-mail: info@example.com")
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusBuildIndex)
#
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)
Entiendo que no todas las personas podrán entrenar su propio vocabulario binario; esto requiere corpus de texto y recursos informáticos importantes. Por lo tanto, el ASC es capaz de trabajar con un solo archivo ARPA como diccionario principal.
Ejemplo de trabajo
spell.json
Python
{
"ad": 13,
"cw": 38120,
"debug": 1,
"threads": 0,
"method": "spell",
"alter": {"":""},
"asc-split": true,
"asc-alter": true,
"confidence": true,
"asc-esplit": true,
"asc-rsplit": true,
"asc-uppers": true,
"asc-hyphen": true,
"mixed-dicts": true,
"asc-wordrep": true,
"spell-verbose": true,
"r-text": "./texts/test.txt",
"w-text": "./texts/output.txt",
"upwords": "./texts/words/upp",
"r-arpa": "./dictionary/alm.arpa",
"r-abbr": "./dictionary/alm.abbr",
"abbrs": "./texts/abbrs/abbrs.txt",
"alters": "./texts/alters/yoficator.txt",
"mix-restwords": "./similars/letters.txt",
"goodwords": "./texts/whitelist/words.txt",
"badwords": "./texts/blacklist/garbage.txt",
"pilots": ["","","","","","","","","","","a","i","o","e","g"],
"alphabet": "abcdefghijklmnopqrstuvwxyz",
"embedding-size": 28,
"embedding": {
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}
}
$ ./asc -r-json ./spell.json
Python
import asc
#
asc.setThreads(0)
#
asc.setOption(asc.options_t.uppers)
#
asc.setOption(asc.options_t.ascSplit)
#
asc.setOption(asc.options_t.ascAlter)
#
asc.setOption(asc.options_t.ascESplit)
#
asc.setOption(asc.options_t.ascRSplit)
#
asc.setOption(asc.options_t.ascUppers)
#
asc.setOption(asc.options_t.ascHyphen)
#
asc.setOption(asc.options_t.ascWordRep)
#
asc.setOption(asc.options_t.mixDicts)
# ARPA - ,
asc.setOption(asc.options_t.confidence)
# ( )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
# ( )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})
#
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addGoodword(word)
f.close()
#
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addBadword(word)
f.close()
#
f = open('./output/alm.abbr')
for word in f.readlines():
word = word.replace("\n", "")
asc.addSuffix(word)
f.close()
# , (, , ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
abbr = abbr.replace("\n", "")
asc.addAbbr(abbr)
f.close()
# , (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
word = word.replace("\n", "")
asc.addUWord(word)
f.close()
#
asc.addAlt("", "")
# , ( )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
words = words.replace("\n", "")
words = words.split('\t')
asc.addAlt(words[0], words[1])
f.close()
def statusArpa(status):
print("Read arpa", status)
def statusIndex(status):
print("Build index", status)
# ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)
# (38120 13 )
asc.setAdCw(38120, 13)
# ( , )
asc.setEmbedding({
"": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
"": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
"": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
"": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
"": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
"": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
"-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
"%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
"\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
"5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
"b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
"h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
"n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
"t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)
#
asc.buildIndex(statusIndex)
f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')
for line in f1.readlines():
res = asc.spell(line)
f2.write("%s\n" % res[0])
f2.close()
f1.close()
PD Para aquellos que no quieran coleccionar y entrenar nada en absoluto, les planteé la versión web de ASC . También debe tenerse en cuenta que el sistema para corregir errores tipográficos no es un sistema omnisciente y es imposible alimentar todo el idioma ruso allí. ASC no corregirá ningún texto, es necesario entrenar por separado para cada tema.