🖤 📳 😤 Corrector ortográfico ANYKS 🕵🏾 🕧 ⚱️

Hola, este es mi tercer artículo sobre Habré, antes escribí un artículo sobre el modelo de lenguaje ALM . Ahora, quiero presentarles el sistema de corrección de errores tipográficos ASC (implementado sobre la base de ALM ).

Sí, hay una gran cantidad de sistemas para corregir errores tipográficos, todos tienen sus propias fortalezas y debilidades, de los sistemas abiertos puedo destacar uno de los JamSpell más prometedores , y lo compararemos. También hay un sistema similar de DeepPavlov , en el que muchos podrían pensar, pero nunca me hice amigo de él.

Lista de características:

Corrección de errores en palabras con una diferencia de hasta 4 distancias de Levenshtein.
Corrección de errores tipográficos en palabras (inserción, borrado, reemplazo, reordenamiento) de caracteres.
ficación dado el contexto.
Poniendo el caso de la primera letra de la palabra para (nombres propios y títulos), teniendo en cuenta el contexto.
Dividir las palabras combinadas en palabras separadas, teniendo en cuenta el contexto.
Realiza análisis de texto sin corregir el texto original.
Busque en el texto la presencia (errores, errores tipográficos, contexto incorrecto).

Sistemas operativos compatibles:

Mac OS X
FreeBSD
Linux

El sistema está escrito en C ++ 11, hay un puerto para Python3

Diccionarios listos

Nombre	Tamaño (GB)	RAM (GB)	Tamaño N-gramos	Idioma
wittenbell-3-big.asc	1,97	15,6	3	RU
wittenbell-3-middle.asc	1,24	9,7	3	RU
mkneserney-3-middle.asc	1,33	9,7	3	RU
wittenbell-3-single.asc	0,772	5.14	3	RU
wittenbell-5-single.asc	1,37	10,7	cinco	RU

Pruebas

Para probar el sistema se utilizaron datos del concurso de "corrección de errores tipográficos" Dialog21 de 2016 . Se utilizó un diccionario binario entrenado para las pruebas: wittenbell-3-middle.asc

Prueba realizada	Precisión	Recordar	Medida
Modo de corrección de errores tipográficos	76,97	62,71	69.11
Modo de corrección de errores	73,72	60,53	66,48

Creo que es innecesario agregar otros datos, si lo desea, todos pueden repetir la prueba, adjunto todos los materiales utilizados en la prueba a continuación.

Materiales utilizados en las pruebas

test.txt : texto para probar
correct.txt : texto de las variantes correctas
evalua.py: secuencia de comandos de Python3 para calcular los resultados de la corrección

Ahora, es interesante comparar el funcionamiento de los sistemas para corregir errores tipográficos en igualdad de condiciones, entrenaremos dos errores tipográficos diferentes sobre los mismos datos de texto y realizaremos una prueba.

A modo de comparación, tomemos el sistema de corrección de errores tipográficos que mencioné anteriormente, JamSpell .

ASC vs JamSpell

Instalación

ASC

$ git clone --recursive https://github.com/anyks/asc.git
$ cd ./asc

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

JamSpell

$ git clone https://github.com/bakwc/JamSpell.git
$ cd ./JamSpell

$ mkdir ./build
$ cd ./build

$ cmake ..
$ make

Formación

ASC

train.json

{
  "ext": "txt",
  "size": 3,
  "alter": {"":""},
  "debug": 1,
  "threads": 0,
  "method": "train",
  "allow-unk": true,
  "reset-unk": true,
  "confidence": true,
  "interpolate": true,
  "mixed-dicts": true,
  "only-token-words": true,
  "locale": "en_US.UTF-8",
  "smoothing": "wittenbell",
  "pilots": ["","","","","","","","","","","a","i","o","e","g"],
  "corpus": "./texts/correct.txt",
  "w-bin": "./dictionary/3-middle.asc",
  "w-vocab": "./train/lm.vocab",
  "w-arpa": "./train/lm.arpa",
  "mix-restwords": "./similars/letters.txt",
  "alphabet": "abcdefghijklmnopqrstuvwxyz",
  "bin-code": "ru",
  "bin-name": "Russian",
  "bin-author": "You name",
  "bin-copyright": "You company LLC",
  "bin-contacts": "site: https://example.com, e-mail: info@example.com",
  "bin-lictype": "MIT",
  "bin-lictext": "... License text ...",
  "embedding-size": 28,
  "embedding": {
      "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
      "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
      "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
      "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
      "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
      "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
      "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
      "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
      "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
      "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
      "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
      "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
      "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
      "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
  }
}

$ ./asc -r-json ./train.json

Python3

import asc

asc.setSize(3)
asc.setAlmV2()
asc.setThreads(0)
asc.setLocale("en_US.UTF-8")

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.allowUnk)
asc.setOption(asc.options_t.resetUnk)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.tokenWords)
asc.setOption(asc.options_t.confidence)
asc.setOption(asc.options_t.interpolate)

asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")

asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

def statusArpa1(status):
    print("Build arpa", status)

def statusArpa2(status):
    print("Write arpa", status)

def statusVocab(status):
    print("Write vocab", status)

def statusIndex(text, status):
    print(text, status)

def status(text, status):
    print(text, status)

asc.collectCorpus("./texts/correct.txt", asc.smoothing_t.wittenBell, 0.0, False, False, status)

asc.buildArpa(statusArpa1)

asc.writeArpa("./train/lm.arpa", statusArpa2)

asc.writeVocab("./train/lm.vocab", statusVocab)

asc.setCode("RU")
asc.setLictype("MIT")
asc.setName("Russian")
asc.setAuthor("You name")
asc.setCopyright("You company LLC")
asc.setLictext("... License text ...")
asc.setContacts("site: https://example.com, e-mail: info@example.com")

asc.setEmbedding({
     "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
     "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
     "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
     "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
     "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
     "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
     "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
     "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
     "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
     "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
     "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
     "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
     "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
     "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

JamSpell

$ ./main/jamspell train ../test_data/alphabet_ru.txt ../test_data/correct.txt ./model.bin

Pruebas

ASC

spell.json

{
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "spell-verbose": true,
    "confidence": true,
    "mixed-dicts": true,
    "asc-split": true,
    "asc-alter": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "asc-wordrep": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "r-bin": "./dictionary/3-middle.asc"
}

$ ./asc -r-json ./spell.json

Python3

import asc

asc.setAlmV2()
asc.setThreads(0)

asc.setOption(asc.options_t.uppers)
asc.setOption(asc.options_t.ascSplit)
asc.setOption(asc.options_t.ascAlter)
asc.setOption(asc.options_t.ascESplit)
asc.setOption(asc.options_t.ascRSplit)
asc.setOption(asc.options_t.ascUppers)
asc.setOption(asc.options_t.ascHyphen)
asc.setOption(asc.options_t.ascWordRep)
asc.setOption(asc.options_t.mixDicts)
asc.setOption(asc.options_t.confidence)

def status(text, status):
    print(text, status)

asc.loadIndex("./dictionary/3-middle.asc", "", status)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

JamSpell

- Python , C++

#include <fstream>
#include <iostream>
#include <jamspell/spell_corrector.hpp>

//   BOOST
#ifdef USE_BOOST_CONVERT
	#include <boost/locale/encoding_utf.hpp>
//     
#else
	#include <codecvt>
#endif

using namespace std;

/**
 * convert    utf-8  
 * @param  str  utf-8  
 * @return      
 */
const string convert(const wstring & str){
	//   
	string result = "";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <char> (str.c_str(), str.c_str() + str.size());
//     
#else
		//     UTF-8
		using convert_type = codecvt_utf8 <wchar_t, 0x10ffff, little_endian>;
		//  
		wstring_convert <convert_type, wchar_t> conv;
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		//    utf-8 
		result = conv.to_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * convert      utf-8
 * @param  str   
 * @return       utf-8
 */
const wstring convert(const string & str){
	//   
	wstring result = L"";
	//   
	if(!str.empty()){
//   BOOST
#ifdef USE_BOOST_CONVERT
		//  
		using boost::locale::conv::utf_to_utf;
		//    utf-8 
		result = utf_to_utf <wchar_t> (str.c_str(), str.c_str() + str.size());
//     
#else
		//  
		// wstring_convert <codecvt_utf8 <wchar_t>> conv;
		wstring_convert <codecvt_utf8_utf16 <wchar_t, 0x10ffff, little_endian>> conv;
		//    utf-8 
		result = conv.from_bytes(str);
#endif
	}
	//  
	return result;
}

/**
 * safeGetline     
 * @param  is  
 * @param  t     
 * @return     
 */
istream & safeGetline(istream & is, string & t){
	//  
	t.clear();

	istream::sentry se(is, true);
	streambuf * sb = is.rdbuf();

	for(;;){
		int c = sb->sbumpc();
		switch(c){
 			case '\n': return is;
			case '\r':
				if(sb->sgetc() == '\n') sb->sbumpc();
				return is;
			case streambuf::traits_type::eof():
				if(t.empty()) is.setstate(ios::eofbit);
				return is;
			default: t += (char) c;
		}
	}
}

/**
* main   
*/
int main(){
	//  
	NJamSpell::TSpellCorrector corrector;
	//   
	corrector.LoadLangModel("model.bin");
	//    
	ifstream file1("./test_data/test.txt", ios::in);
	//   
	if(file1.is_open()){
		//    
		string line = "", res = "";
		//    
		ofstream file2("./test_data/output.txt", ios::out);
		//   
		if(file2.is_open()){
			//       
			while(file1.good()){
				//    
				safeGetline(file1, line);
				//   ,  
				if(!line.empty()){
					//   
					res = convert(corrector.FixFragment(convert(line)));
					//   ,    
					if(!res.empty()){
						//   
						res.append("\n");
						//    
						file2.write(res.c_str(), res.size());
					}
				}
			}
			//  
			file2.close();
		}
		//  
		file1.close();
	}
    return 0;
}

$ g++ -std=c++11 -I../JamSpell -L./build/jamspell -L./build/contrib/cityhash -L./build/contrib/phf -ljamspell_lib -lcityhash -lphf ./test.cpp -o ./bin/test

$ ./bin/test

resultados

Obteniendo resultados

$ python3 evaluate.py ./texts/test.txt ./texts/correct.txt ./texts/output.txt

ASC

Precisión	Recordar	Medida
92,13	82,51	87.05

JamSpell

Precisión	Recordar	Medida
77,87	63,36	69,87

Una de las principales características de ASC es aprender de los datos sucios. Es prácticamente imposible encontrar corpus de texto sin errores y errores tipográficos en el acceso abierto. No basta con arreglar terabytes de datos a mano, pero es necesario trabajar con ellos de alguna manera.

El principio de enseñanza que ofrezco

Elaboración de un modelo de lenguaje utilizando datos sucios
Eliminamos todas las palabras raras y N-gramas en el modelo de lenguaje ensamblado
Agregamos palabras sueltas para un funcionamiento más correcto del sistema de corrección de errores tipográficos.
Armar un diccionario binario

Empecemos

Supongamos que tenemos varios corpus de diferentes asignaturas, lo más lógico es entrenarlos por separado, luego combinarlos.

Ensamblar el chasis usando ALM

collect.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"ext": "txt",
	"method": "train",
	"allow-unk": true,
	"mixed-dicts": true,
	"only-token-words": true,
	"smoothing": "wittenbell",
	"locale": "en_US.UTF-8",
	"w-abbr": "./output/alm.abbr",
	"w-map": "./output/alm.map",
	"w-vocab": "./output/alm.vocab",
	"w-words": "./output/words.txt",
	"corpus": "./texts/corpus",
	"abbrs": "./abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./collect.json

size — N- 3
debug —
threads —
ext —
allow-unk — 〈unk〉
mixed-dicts —
only-token-words — N- —
smoothing — wittenbell ( , - )
locale — ( )
w-abbr —
w-map —
w-vocab —
w-words — ( )
corpus —
abbrs — , , (, , ...)
goodwords —
badwords —
mix-restwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#    N- —       
alm.setOption(alm.options_t.tokenWords)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#   ,  ,   (, ,  ...)
f = open('./abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    alm.addAbbr(abbr)
f.close()

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def status(text, status):
    print(text, status)

def statusWords(status):
    print("Write words", status)

def statusVocab(status):
    print("Write vocab", status)

def statusMap(status):
    print("Write map", status)

def statusSuffix(status):
    print("Write suffix", status)

#    
alm.collectCorpus("./texts/corpus", status)
#      
alm.writeWords("./output/words.txt", statusWords)
#   
alm.writeVocab("./output/alm.vocab", statusVocab)
#    
alm.writeMap("./output/alm.map", statusMap)
#      
alm.writeSuffix("./output/alm.abbr", statusSuffix)

Poda de un casco ensamblado con ALM

prune.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "vprune",
    "vprune-wltf": -15.0,
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./corpus1/alm.map",
    "r-vocab": "./corpus1/alm.vocab",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./prune.json

size — N- 3
debug —
allow-unk — 〈unk〉
vprune-wltf — - (, — )
locale — ( )
smoothing — wittenbell ( , - )
r-map —
r-vocab —
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")

#    <unk>   
alm.setOption(alm.options_t.allowUnk)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

def statusPrune(status):
    print("Prune data", status)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#  
alm.readVocab("./corpus1/alm.vocab", statusReadVocab)
#    
alm.readMap("./corpus1/alm.map", statusReadMap)
#   
alm.pruneVocab(-15.0, 0, 0, statusPrune)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

Combinando datos recopilados con ALM

merge.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "method": "merge",
    "mixed-dicts": "true",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-words": "./texts/words",
    "r-map": "./corpus1",
    "r-vocab": "./corpus1",
    "w-map": "./output/alm.map",
    "w-vocab": "./output/alm.vocab",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "mix-restwords": "./texts/similars/letters.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./merge.json

size — N- 3
debug —
allow-unk — 〈unk〉
mixed-dicts —
locale — ( )
smoothing — wittenbell ( , - )
r-words —
r-map — ,
r-vocab — ,
w-map —
w-vocab —
goodwords —
badwords —
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#         
alm.setOption(alm.options_t.mixDicts)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

#      
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addGoodword(word)
f.close()

#      
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addBadword(word)
f.close()

#         
f = open('./texts/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    alm.addWord(word)
f.close()

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusWriteVocab(status):
    print("Write vocab", status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusWriteMap(status):
    print("Write map", status)

#   
alm.readVocab("./corpus1", statusReadVocab)
#    
alm.readMap("./corpus1", statusReadMap)
#   
alm.writeVocab("./output/alm.vocab", statusWriteVocab)
#    
alm.writeMap("./output/alm.map", statusWriteMap)

Aprendiendo el modelo de lenguaje con ALM

train.json

{
    "size": 3,
    "debug": 1,
    "allow-unk": true,
    "reset-unk": true,
    "interpolate": true,
    "method": "train",
    "locale": "en_US.UTF-8",
    "smoothing": "wittenbell",
    "r-map": "./output/alm.map",
    "r-vocab": "./output/alm.vocab",
    "w-arpa": "./output/alm.arpa",
    "w-words": "./output/words.txt",
    "alphabet": "abcdefghijklmnopqrstuvwxyz"
}

$ ./alm -r-json ./train.json

size — N- 3
debug —
allow-unk — 〈unk〉
reset-unk — , 〈unk〉
interpolate —
locale — ( )
smoothing — wittenbell
r-map — ,
r-vocab — ,
w-arpa — ARPA,
w-words — , ( )
alphabet — ( )

Python

import alm

#   N-  3
alm.setSize(3)
#      
alm.setThreads(0)
#    (  )
alm.setLocale("en_US.UTF-8")
#      (        )
alm.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     
alm.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#    <unk>   
alm.setOption(alm.options_t.allowUnk)
#      <unk>   
alm.setOption(alm.options_t.resetUnk)
#         
alm.setOption(alm.options_t.mixDicts)
#     
alm.setOption(alm.options_t.interpolate)

#    wittenbell (     ,  -    )
alm.init(alm.smoothing_t.wittenBell)

def statusReadVocab(text, status):
    print("Read vocab", text, status)

def statusReadMap(text, status):
    print("Read map", text, status)

def statusBuildArpa(status):
    print("Build ARPA", status)

def statusWriteMap(status):
    print("Write map", status)

def statusWriteArpa(status):
    print("Write ARPA", status)

def statusWords(status):
    print("Write words", status)

#   
alm.readVocab("./output/alm.vocab", statusReadVocab)
#    
alm.readMap("./output/alm.map", statusReadMap)

#     
alm.buildArpa(statusBuildArpa)

#       ARPA
alm.writeArpa("./output/alm.arpa", statusWriteArpa)

#   
alm.writeWords("./output/words.txt", statusWords)

Entrenamiento de corrector ortográfico ASC

train.json

{
	"size": 3,
	"debug": 1,
	"threads": 0,
	"confidence": true,
	"mixed-dicts": true,
	"method": "train",
	"alter": {"":""},
	"locale": "en_US.UTF-8",
	"smoothing": "wittenbell",
	"pilots": ["","","","","","","","","","","a","i","o","e","g"],
	"w-bin": "./dictionary/3-single.asc",
	"r-abbr": "./output/alm.abbr",
	"r-vocab": "./output/alm.vocab",
	"r-arpa": "./output/alm.arpa",
	"abbrs": "./texts/abbrs/abbrs.txt",
	"goodwords": "./texts/whitelist/words.txt",
	"badwords": "./texts/blacklist/garbage.txt",
	"alters": "./texts/alters/yoficator.txt",
	"upwords": "./texts/words/upp",
	"mix-restwords": "./texts/similars/letters.txt",
	"alphabet": "abcdefghijklmnopqrstuvwxyz",
	"bin-code": "ru",
	"bin-name": "Russian",
	"bin-author": "You name",
	"bin-copyright": "You company LLC",
	"bin-contacts": "site: https://example.com, e-mail: info@example.com",
	"bin-lictype": "MIT",
	"bin-lictext": "... License text ...",
	"embedding-size": 28,
	"embedding": {
	    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
	    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
	    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
	    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
	    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
	    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
	    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
	    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
	    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
	    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
	    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
	    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
	    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
	    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
	}
}

$ ./asc -r-json ./train.json

size — N- 3
debug —
threads —
confidence — ARPA - ,
mixed-dicts —
alter — ( , , — «»)
locale — ( )
smoothing — wittenbell ( , - )
pilots — ( )
w-bin —
r-abbr — ,
r-vocab — ,
r-arpa — ARPA,
abbrs — , , (, , ...)
goodwords —
badwords —
alters — , ( )
upwords — , (, , ...)
mix-restwords —
alphabet — ( )
bin-code —
bin-name —
bin-author —
bin-copyright —
bin-contacts —
bin-lictype —
bin-lictext —
embedding-size —
embedding — ( , )

Python

import asc

#   N-  3
asc.setSize(3)
#      
asc.setThreads(0)
#    (  )
asc.setLocale("en_US.UTF-8")

#        
asc.setOption(asc.options_t.uppers)
#    <unk>   
asc.setOption(asc.options_t.allowUnk)
#      <unk>   
asc.setOption(asc.options_t.resetUnk)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusIndex(text, status):
    print(text, status)

def statusBuildIndex(status):
    print("Build index", status)

def statusArpa(status):
    print("Read arpa", status)

def statusVocab(status):
    print("Read vocab", status)

#        ARPA
asc.readArpa("./output/alm.arpa", statusArpa)
#   
asc.readVocab("./output/alm.vocab", statusVocab)

#     
asc.setCode("RU")
#    
asc.setLictype("MIT")
#   
asc.setName("Russian")
#    
asc.setAuthor("You name")
#   
asc.setCopyright("You company LLC")
#    
asc.setLictext("... License text ...")
#     
asc.setContacts("site: https://example.com, e-mail: info@example.com")

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusBuildIndex)

#     
asc.saveIndex("./dictionary/3-middle.asc", "", 128, statusIndex)

Entiendo que no todas las personas podrán entrenar su propio vocabulario binario; esto requiere corpus de texto y recursos informáticos importantes. Por lo tanto, el ASC es capaz de trabajar con un solo archivo ARPA como diccionario principal.

Ejemplo de trabajo

spell.json

{
    "ad": 13,
    "cw": 38120,
    "debug": 1,
    "threads": 0,
    "method": "spell",
    "alter": {"":""},
    "asc-split": true,
    "asc-alter": true,
    "confidence": true,
    "asc-esplit": true,
    "asc-rsplit": true,
    "asc-uppers": true,
    "asc-hyphen": true,
    "mixed-dicts": true,
    "asc-wordrep": true,
    "spell-verbose": true,
    "r-text": "./texts/test.txt",
    "w-text": "./texts/output.txt",
    "upwords": "./texts/words/upp",
    "r-arpa": "./dictionary/alm.arpa",
    "r-abbr": "./dictionary/alm.abbr",
    "abbrs": "./texts/abbrs/abbrs.txt",
    "alters": "./texts/alters/yoficator.txt",
    "mix-restwords": "./similars/letters.txt",
    "goodwords": "./texts/whitelist/words.txt",
    "badwords": "./texts/blacklist/garbage.txt",
    "pilots": ["","","","","","","","","","","a","i","o","e","g"],
    "alphabet": "abcdefghijklmnopqrstuvwxyz",
    "embedding-size": 28,
    "embedding": {
        "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
        "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
        "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
        "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
        "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
        "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
        "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
        "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
        "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
        "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
        "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
        "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
        "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
        "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
    }
}

$ ./asc -r-json ./spell.json

Python

import asc

#      
asc.setThreads(0)
#        
asc.setOption(asc.options_t.uppers)
#   
asc.setOption(asc.options_t.ascSplit)
#   
asc.setOption(asc.options_t.ascAlter)
#      
asc.setOption(asc.options_t.ascESplit)
#      
asc.setOption(asc.options_t.ascRSplit)
#     
asc.setOption(asc.options_t.ascUppers)
#     
asc.setOption(asc.options_t.ascHyphen)
#    
asc.setOption(asc.options_t.ascWordRep)
#         
asc.setOption(asc.options_t.mixDicts)
#     ARPA -  ,  
asc.setOption(asc.options_t.confidence)

#      (        )
asc.setAlphabet("abcdefghijklmnopqrstuvwxyz")
#     (     )
asc.setPilots(["","","","","","","","","","","a","i","o","e","g"])
#     
asc.setSubstitutes({'p':'','c':'','o':'','t':'','k':'','e':'','a':'','h':'','x':'','b':'','m':''})

#       
f = open('./texts/whitelist/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addGoodword(word)
f.close()

#       
f = open('./texts/blacklist/garbage.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addBadword(word)
f.close()

#     
f = open('./output/alm.abbr')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addSuffix(word)
f.close()

#    ,   (, ,  ...)
f = open('./texts/abbrs/abbrs.txt')
for abbr in f.readlines():
    abbr = abbr.replace("\n", "")
    asc.addAbbr(abbr)
f.close()

#     ,       (, , ...)
f = open('./texts/words/upp/words.txt')
for word in f.readlines():
    word = word.replace("\n", "")
    asc.addUWord(word)
f.close()

#   
asc.addAlt("", "")

#       ,     (        )
f = open('./texts/alters/yoficator.txt')
for words in f.readlines():
    words = words.replace("\n", "")
    words = words.split('\t')
    asc.addAlt(words[0], words[1])
f.close()

def statusArpa(status):
    print("Read arpa", status)

def statusIndex(status):
    print("Build index", status)

#        ARPA
asc.readArpa("./dictionary/alm.arpa", statusArpa)

#    (38120      13    )
asc.setAdCw(38120, 13)

#      ( ,     )
asc.setEmbedding({
    "": 0, "": 1, "": 2, "": 3, "": 4, "": 5,
    "": 5, "": 6, "": 7, "": 8, "": 8, "": 9,
    "": 10, "": 11, "": 12, "": 0, "": 13, "": 14,
    "": 15, "": 16, "": 17, "": 18, "": 19, "": 20,
    "": 21, "": 21, "": 21, "": 22, "": 23, "": 22,
    "": 5, "": 24, "": 25, "<": 26, ">": 26, "~": 26,
    "-": 26, "+": 26, "=": 26, "*": 26, "/": 26, ":": 26,
    "%": 26, "|": 26, "^": 26, "&": 26, "#": 26, "'": 26,
    "\\": 26, "0": 27, "1": 27, "2": 27, "3": 27, "4": 27,
    "5": 27, "6": 27, "7": 27, "8": 27, "9": 27, "a": 0,
    "b": 2, "c": 15, "d": 4, "e": 5, "f": 18, "g": 3,
    "h": 12, "i": 8, "j": 6, "k": 9, "l": 10, "m": 11,
    "n": 12, "o": 0, "p": 14, "q": 13, "r": 14, "s": 15,
    "t": 16, "u": 24, "v": 21, "w": 22, "x": 19, "y": 17, "z": 7
}, 28)

#     
asc.buildIndex(statusIndex)

f1 = open('./texts/test.txt')
f2 = open('./texts/output.txt', 'w')

for line in f1.readlines():
    res = asc.spell(line)
    f2.write("%s\n" % res[0])

f2.close()
f1.close()

PD Para aquellos que no quieran coleccionar y entrenar nada en absoluto, les planteé la versión web de ASC . También debe tenerse en cuenta que el sistema para corregir errores tipográficos no es un sistema omnisciente y es imposible alimentar todo el idioma ruso allí. ASC no corregirá ningún texto, es necesario entrenar por separado para cada tema.

Corrector ortográfico ANYKS

Lista de características:

Sistemas operativos compatibles:

Diccionarios listos

Pruebas

Materiales utilizados en las pruebas

ASC vs JamSpell

resultados

El principio de enseñanza que ofrezco

Empecemos

More articles: