👨🏿‍💼 👩🏾‍🏫 🏓 PNL: extraer datos del texto con el analizador de Tomita 🐷 🐝 👇🏽

PNL - procesamiento del lenguaje natural

La mayoría de los datos del mundo no están estructurados, son solo textos en ruso o en cualquier otro idioma. Los datos extraídos de dichos textos pueden ser de especial interés para los negocios, por lo que a menudo surgen tales tareas. Un área separada de la inteligencia artificial se ocupa de este tema: el procesamiento del lenguaje natural, el mismo NLP ( procesamiento del lenguaje natural ).

, , , .

, ,

, – , , , , . , , , , , :

№2 ,15 2020., 5400,00

.. 15299,00 ,

№575 , 145 17.09.2020 2020 , 18% — 5300 .

, 23, 51 01.09.2020 — 7500 .

№1-03 01.07.2020 211 2020 23000 ..(18%)

-?

- – () , (- ) . GitHub, .

-?

– ,
–
–
–

-?

- , , , . , , , .

tomitaparser.exe ( . ) :

config.proto — . , . tomitaparser.exe;
dic.gzt – . . , , , . ;
mygram.cxx – . , . . ;
facttypes.proto – ;
kwtypes.proto – . , .

utf8 , ( ).

«dic.gzt», , .

encoding "utf8"; //   

//        ,     
import "base.proto";
import "articles_base.proto";
//     
TAuxDicArticle "payment" {
    key = { "tomita:mygram.cxx" type=CUSTOM }
};

. , — , . . «->» . , – . . . (Noun, Verb, Adj), (Comma, Punct, Ampersand, PlusSign) . . .

. , «+» () , . .

() , (), , . () «< >» . - . , «cxx», – «mygram.cxx». . . — , , «», «», «».

#encoding "utf8" //   

//  "|"    ""
Rent -> '' | '' | '';
//  "" ,       0   
//  <gnc-agr[1]>   ,         ,   
Purpose -> Rent Adj<gnc-agr[1]> Noun<gnc-agr[1]>;

. , , , . , , .

//   StreetW    ,   StreetAbbr -   
StreetW -> '' | '' | '' | '';
StreetAbbr -> '' | '' | '' | '-' | '';

//       StreetDescr,       StreetW   StreetAbbr
StreetDescr -> StreetW | StreetAbbr;
StreetNameNoun -> (Adj<gnc-agr[1]>) Word<gnc-agr[1], rt> (Word<gram="">);
StreetNameAdj -> Adj<h-reg1> Adj*;

«StreetNameNoun» , . , , «<rt>». , , . , , . . , , .. , «()». «StreetNameAdj» , . . «<h-reg1>». , «*». , .

Address -> StreetDescr StreetNameNoun<gram="", h-reg1>;
Address -> StreetDescr StreetNameNoun<gram="", h-reg1>;

Address -> StreetNameAdj<gnc-agr[1]> StreetW<gnc-agr[1]>;
Address -> StreetNameAdj StreetAbbr;

. , . , . . , . :

//       «dic.gzt»
TAuxDicArticle "month" {
    key = { "" | "" | "" | "" | "" | "" | "" | "" | "" | "" | "" | "" }
};

Month -> Noun<kwtype="month">;
Year -> AnyWord<wff=/[1-2]?[0-9]{1,3}?\.?/>;

Period -> Month Year;

«kwtype» , «month» , 0 2999 «» «.» . , . «Result» :

Result -> Purpose AnyWord* Address AnyWord* Period;
Result -> Purpose AnyWord* Address;
Result -> Purpose;

«AnyWord» «*» , 0 . : , . : , .

. – «facttypes.proto» «dic.gzt» (, - , ).

import "facttypes.proto"; //    «dic.gzt»

«facttypes.proto» «Payment» (): , . :

//   
import "base.proto";
import "facttypes_base.proto";

message Payment: NFactType.TFact {
    required string Purpose = 1;
    optional string Address = 2;
    optional string Period = 3;
};

«Payment» «NFactType.TFact», «required» «optional» , . , , «interp» , . , .

//  «Purpose»    «Purpose»  «Payment»
//  «Address»    «Address»  «Payment»
//  «Period»    «Period»  «Payment»
Result -> Purpose interp(Payment.Purpose) AnyWord* Address interp(Payment.Address) AnyWord* Period interp(Payment.Period);
Result -> Purpose interp(Payment.Purpose) AnyWord* Address interp(Payment.Address);
Result -> Purpose interp(Payment.Purpose);

, , , .

encoding "utf8"; //   

TTextMinerConfig {
    //   
    Dictionary = "dic.gzt";
    //  
    Input = {File = "input.txt"}
    //      
    Output = {File = "output.txt"
            Format = text}
    // ,     
    Articles = [
        { Name = "payment" }
        ]
    // ,  
    Facts = [
        { Name = "Payment" }
        ]
    //       
    PrettyOutput = "pretty.html"
}

> tomitaparser.exe config.proto

En el archivo " input.txt " hemos colocado el texto fuente al principio del artículo. Después del trabajo, el analizador escribió el resultado en el archivo " output.txt ":

     № 2      , 15   2020 . ,   5400,00  
    Payment
    {
        Purpose =   
        Address =   
        Period =  2020
    }
     . .      15299,00  ,   
    Payment
    {
        Purpose =  
    }
    № 575      , 145  17.09.2020   2020  ,   18% - 5300  . 
    Payment
    {
        Purpose =  
        Address =   
        Period =  2020
    }
          , 23 ,  51  01.09.2020 - 7500  . 
    Payment
    {
        Purpose =   
        Address =  
    }
     № 1-03  01.07.2020       211   2020  23000  .. ( 18% ) 
    Payment
    {
        Purpose =  
        Address =  
        Period =  2020
    }

Extraer datos del lenguaje natural es una tarea bastante no trivial en el mundo de la TI hasta el día de hoy. Ahora tenemos otra herramienta disponible en nuestras manos. Como puede ver, crear su primera gramática puede ser bastante fácil, mientras dedica un poco de tiempo a aprender. para Tomita, se proporciona documentación detallada y completa. Sin embargo, la calidad de los hechos destacados depende en gran medida del propio desarrollador y de su conocimiento en el campo de los expertos.

PNL: extraer datos del texto con el analizador de Tomita

-?

-?

-?

More articles: