¡Hola, Habr! Hoy será la parte final del tema Agrupación y clasificación de datos de Big Text mediante el aprendizaje automático en Java. Este artículo es una continuación del primero y segundo artículos .
El artículo describe la arquitectura del sistema, el algoritmo y los resultados visuales. Todos los detalles de la teoría y los algoritmos se pueden encontrar en los dos primeros artículos.
Las arquitecturas del sistema se pueden dividir en dos partes principales: aplicación web y software de clasificación y agrupamiento de datos
El algoritmo del software de aprendizaje automático consta de 3 partes principales:
procesamiento natural del lenguaje;
tokenización;
lematización;
dejar de cotizar;
frecuencia de palabras;
métodos de agrupamiento;
TF-IDF;
SVD;
encontrar grupos de clústeres;
métodos de clasificación - API de Aylien.
Procesamiento natural del lenguaje
El algoritmo comienza leyendo cualquier dato de texto. Dado que nuestro sistema es una biblioteca electrónica, la mayoría de los libros están en formato pdf. Puede leer la implementación y los detalles del procesamiento de NLP aquí .
A continuación se muestra una comparación cuando se ejecutan los algoritmos de lematización y stemmitización:
: 4173415 : 88547 : 82294
, , , . , :
characterize, design, space, render, robot, face, alisa, kalegina, university, washington, seattle, washington, grace, schroeder, university, washington, seattle, washington, aidan, allchin, lakeside, also, il, school, seattle, washington, keara, berlin, macalester, college, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, university, washington, seattle, washington, abstract, face, critical, establish, agency, social, robot, building, expressive, mechanical, face, costly, difficult, robot, build, year, face, ren, der, screen, great, flexibility, robot, face, open, design, space, tablish, robot, character, perceive, property, despite, prevalence, robot, render, face, systematic, exploration, design, space, work, aim, fill, gap, conduct, survey, identify, robot, render, face, code, term, property, statistics
, :
character, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, facecharacter, design, space, render, robot, face, alisa, kalegina, univers, washington, seattl, washington, grace, schroeder, univers, washington, seattl, washington, grsuwedu, aidan, allchin, lakesid, also, il, school, seattl, washington, keara, berlin, macalest, colleg, saint, paul, minnesota, kearaberlingmailcom, maya, cakmak, univers, washington, seattl, washington, abstract, face, critic, establish, agenc, social, robot, build, express, mechan, face, cost, difficult, mani, robot, built, year, face, ren, dere, screen, great, flexibl, robot, face, open, design, space, tablish, robot, charact, perceiv, properti, despit, preval, robot, render, face, systemat, explor, design, space, work, aim, fill, gap, conduct, survey, identifi, robot, render, face, code, term, properti, statist, common, pattern, observ, data, set, face, conduct, survey, understand, peopl, percep, tion, render, robot, face, identifi, impact, differ, face, featur, survey, result, indic, prefer, vari, level, realism, detail, robot, face
tf-idf . HashMap, - , - -.
-:
, , tf-idf. :
-0.0031139399383999997 0.023330604746 -1.3650204652799997E-4
-0.038380206566 0.00104373247064 0.056140327901
-0.006980774822399999 0.073057418689 -0.0035209342337999996
-0.0047152503238 0.0017397257449 0.024816828582999998
-0.005195951771999999 0.03189764447 -5.9991080912E-4
-0.008568593700999999 0.114337675179 -0.0088221197958
-0.00337365927 0.022604474721999997 -1.1457816390099999E-4
-0.03938283525 -0.0012682796482399999 0.0023486548592
-0.034341362795999995 -0.00111758118864 0.0036010404917
-0.0039026609385999994 0.0016699372352999998 0.021206653766000002
-0.0079418490394 0.003116062838 0.072380311755
-0.007021828444599999 0.0036496566028 0.07869801528199999
-0.0030219410092 0.018637386319 0.00102082843809
-0.0042041069026 0.023621439238999998 0.0022947637053
-0.0061050946438 0.00114796066823 0.018477825284
-0.0065708646563999995 0.0022944737838999996 0.035902813761
-0.037790461814 -0.0015372596281999999 0.008878823611899999
-0.13264545848599998 -0.0144908102251 -0.033606397957999995
-0.016229093174 1.41831464625E-4 0.005181988760999999
-0.024075296507999996 -8.708131965899999E-4 0.0034344653516999997
SVD .
, . – , . OrientDB , OrientDB . OrientDB , , , . . .
, .
– . , , DBSCAN. . . r=0.007. 562 80.000 , . , .
max(D) ‒ , . n -
, . – , –
, . 4-. ( > nt)
N‒ - , S ‒ .
, .
– Aylien API
Aylien API . API json , . API . 9 , . POST API:
String queryText = "select DocText from documents where clusters = '" + cluster + "'";
OResultSet resultSet = database.query(queryText);
while (resultSet.hasNext()) {
OResult result = resultSet.next();
String textDoc = result.toString().replaceAll("[\\<||\\>||\\{||\\}]", "").replaceAll("doctext:", "")
.toLowerCase();
keywords.add(textDoc.replaceAll("\\n", ""));
}
ClassifyByTaxonomyParams.Builder classifyByTaxonomybuilder = ClassifyByTaxonomyParams.newBuilder();
classifyByTaxonomybuilder.setText(keywords.toString());
classifyByTaxonomybuilder.setTaxonomy(ClassifyByTaxonomyParams.StandardTaxonomy.IAB_QAG);
TaxonomyClassifications response = client.classifyByTaxonomy(classifyByTaxonomybuilder.build());
for (TaxonomyCategory c : response.getCategories()) {
clusterUpdate.add(c.getLabel());
}
GET, :
. .
. . , . . , . , :
-
- – . , . - , . Vaadin Flow:
:
, .
.
-.
, , , , -.
.
“Technology & Computing”:
:
:
, . . , , . . . . : .
, , , -, tf-idf, . , . DBSCAN . . , , . , , , , ..
, NoSQL , OrinetDB, 4 NoSQL. , . OrientDB , .
Aylien API, . , 100 . , , , k-, . , .