Addition of Greek language to spaCy - GSOC 2018

NLPBuddy is a side result of the project "Addition of Greek language to spaCy" for Google Summer of Code 2018.

This page serves as a report for the Google Summer of Code 2018 Project.

The project was developed under the auspices of GFOSS - Open Technologies Alliance.

Report Contents



This section provides links to the source code, the documentation and the project timeline.
The main purpose of the section is to list all the places in which you can find work implemented during Google Summer of Code.
There is extensive analysis of the results of the work in the Results section and in the Deliverables section. However, if you need a direct tour to the whole project, this section is for you.

Note: There are two repositories and two wiki pages.
The first repo and the corresponding wiki page include everything that has to do with the addition of the Greek language to spaCy.
The second repo and the corresponding wiki page include everything that has to do with the implementation of a demo app on top of Spacy that demonstrates its' capabilities and supports various features such as sentiment analysis, topic classification, etc.

arrow_right_alt Project repository and Wiki page

The project proposal was mainly about adding Greek language support to spaCy platform.

This goal is accomplished and the source code is provided in the following repository:



There is extensive documentation of every aspect of the process of adding Greek language to spaCy in the following Wiki page:



arrow_right_alt Demo repository and Wiki page

NLPBuddy is a demo produced during the Google Summer of Code. It is built on top of spaCy and it implements various interesting tasks, all supported for Greek language too.

It makes use of the first part of the Google Summer of Code project, the addition of Greek language to spaCy, and it has some quite interesting features such as syntax analysis, emotion analysis, topic classification and a lot more.

The project repo is the following:



The corresponding Wiki page is the following:



arrow_right_alt Timeline

There is a timeline that tracks the whole Google Summer of Code work in a daily basis.

It is divided into the following sections: In progress, Done, TODO, Need test, Need improvement - Future work.

You can find the timeline here.

DISCLAIMER: Due to the huge complexity of the project, it is almost impossible to list everything that was implemented during Google Summer of Code 2018. There are over 50 Completed Tasks in the Timeline, but the list may be enriched in the near future.



Problem Statement - Project Goals

We live in the era of data. Every minute, 3.8 billion internet users, produce content; more than 120 million emails, 500,000 Facebook comments, 3 million Google searches. If we want to process that amount of data efficiently, we need to process natural language. Open source projects such as spaCy, textblob, or NLTK contribute significantly to that direction and thus they need to be reinforced.

This project is about improving the quality of Natural Language Processing of Greek Language.

The project goals can be categorized as following:

  1. Addition of the Greek language to spaCy platform Status: Complete

  2. Production of models for Part-Of-Speech (POS) tagging, Dependency Analysis (DEP) and Named Entities Recognition (NER), with and without word vectors. Status: Complete

  3. An open source text analysis tool (demo) in which everyone can perform common NLP tasks in 7 languages. Status: Complete

  4. Bonus goal: Usage of the addition of Greek language for sentiment analysis and other challenging NLP tasks. Status: Complete



Note : All the project goals have been achieved. Added to this, there are a lot more side results that have been produced during Google Summer of Code 2018. Analysis of the achievements (with pull requests, links to production ready modules, etc) follows in the next two sections.



Results - Production ready tools

arrow_right_alt Addition of Greek language support to spaCy.

Greek language has been successfully added to spaCy, which was actually the most important goal of the project.

Two pull requests have been made; the first pull request is about the initial addition of the language and the second pull request contains important optimizations and additions that enrich the features Greek language class supports.

Addition of the language: You can see the first pull request here (Status: Merged)

Optimizations to the Greek language class: You can see the second pull request here (Status: Merged)

Each part of the process of integrating Greek language to spaCy is discussed in detail in the Wiki page of the project.



arrow_right_alt Greek language models

Two models for Greek language have been produced.

There is an ongoing process of uploading them to spaCy.
After that, you will be able to install them with the folllowing commands:

python3 -m spacy download el_core_web_sm python3 -m spacy download el_core_web_lg For now, you can download the Greek small model tar file from here.

Greek language models support most of the capabilities that you will find in the deliverables section. Sentence splitting, tokenization, Part Of Speech Tagging, Syntax Analysis using DEP tags, Named Entities Recognition, lexical attributes extraction, norm exceptions and stop-words lists, are all included the Greek language models. The big Greek model (el_core_web_lg) includes word vectors so it supports features such as similarity detection between texts.

You can find more about the models production, usage and maintenance, in the models page of the wiki.

Some visualizations from the models usage:

Part of Speech Tagger vol1


Named Entities Recognition vol1


Part of Speech Tagger vol2


Named Entities Recognition vol2


arrow_right_alt NLPBuddy - Open Source Text Analysis Tool

NLPBuddy is an open source text analysis tool that has been developed as a demonstration of the project results.



NLPBuddy leverages Spacy's capabilities to extract as much information as possible from raw text.

Briefly, in this demo you can perform the following tasks with your text in 7 languages:

  1. Language identification (performed using the langid library).
  2. Text tokenization.
  3. Sentence splitting.
  4. Lemmatization.
  5. Part of Speech tags identification.
  6. Named Entity Recognition (Location, Person, Organization).
  7. Text summarization (uses Gensim's implementation of the TextRank algorithm).
  8. Keywords extraction.
  9. For the Greek language, there are also the following bonus features:
    • Text Categorization among the following categories: Sports, Science, World News, Greek News, Environment, Politics, Art, Health, Science.
      The Greek classifier is built with FastText and is trained in 20,000 articles labeled in these categories.
      Accuracy reaches 90% .
    • Text subjectivity analysis.
    • Emotion analysis. It detects the main text emotion among the following emotions: Anger, Disgust, Fear, Happiness, Sadness, Surprise.
  10. Lexical attributes. Get numerals, urls and emails from the text.
  11. Noun chunks. Get noun phrases from your text, such as "the red bicycle".

The supported languages at the moment are the following: Greek, English, German, Spanish, Portuguese, French, Italian and Dutch.

Text can either be provided or imported from a URL. For the preprocess of the text imported from a URL, the following libraries are used: python readability, BeautifulSoup4.

Note: All the functionalities that demo supports (and some more) are implemented as modules so anybody can use them independently.
Those modules are extensively discussed in the deliverables section. The central idea is that this Google Summer of Code project should produce results that are going to be used later on from people all around the world. For that reason, together with my mentor, Markos Gogoulos, we have implemented an API for the Demo so anybody can access the results that it provides (see more here).



arrow_right_alt Improvements in spaCy

A side goal of the project is to empower spaCy itself. There is an open-dialogue with the creators of spaCy, who we would like to thank for their continuous support and enthusiasm.

crop_square Documentation Improvements

A pull request for documentation improvements was successfully merged.

The pull request was about a small error found in the spaCy documentation in the pseudocode provided for overriding the spaCy tokenizer.

You can see the pull request here.

crop_square Sharing awareness

I am invited to write an article for Explosion AI Blog regarding the integration of Greek language to spaCy due to the innovative approaches followed during Google Summer of Code 2018. There is an ongoing process of writing and evaluation of this article till its' publication which may be after the end of Google Summer of Code.

A link to the post will be published here when it's ready.

crop_square Innovative approaches

In the process of integrating Greek language to spaCy some new approaches are followed. Hopefully, these approaches will inspire other languages too.

  • The Greek language is the second language that follows a rule based lemmatization procedure.
  • There were no available data for training NER classifier, so there was a need for creating data. A fast procedure of annotating data using Prodigy annotation tool is proposed for future reference. Learn more about it in the corresponding wiki page.

Deliverables

Deliverables are independent functionality submodules or/and useful resources that were produced either during the process of integrating Greek language to spaCy or during the process of experimenting with the functionalities of spaCy and the demo implementation.

A list of the deliverables and a short description of each of them follows. You can find the functionality submodules in the res/modules folder of the project repo (here), serving as examples for usage.

Each of the deliverables is labelled with one of the following tags: greek-spacy-support , nlp-task, resource.

If you want to learn more, there is an individual page for each of them in the project wiki or the demo wiki.

Deliverables list:
  1. Tokenizer. greek-spacy-support

    You can use this submodule having one of the produced greek models in order to split your sentence(s) to tokens, independently of the others spaCy modules.

    Sample input: Θέλω να μου σπάσεις αυτήν την πρόταση σε κομμάτια
    Sample output: [Θέλω, να, μου, σπάσεις, αυτήν, την, πρόταση, σε, κομμάτια]

    Submodule link.

  2. Lemmatizer. greek-spacy-support

    This submodule is for sentences lemmatization.

    Sample input: Τα σύμβολα του αγώνα.

    Sample output: Original token: Τα , Lemma: τα Original token: σύμβολα , Lemma: σύμβολο Original token: του , Lemma: του Original token: αγώνα , Lemma: αγώνα Original token: . , Lemma: .

    Greek lemmatizer is special because it follows a rule based approach. You can find extensive documentation about lemmatizer in the corresponding wiki page. Submodule link.

  3. Sentence Splitter. greek-spacy-support

    You can use this submodule using one of the produced greek models in order to split sentences in a greek text independently of the rest of the spaCy modules.

    Sample input: Αυτή είναι μια πρόταση. Αυτή είναι μια δεύτερη πρόταση. Και αυτή μια τρίτη πρόταση.

    Sample output:
    [ Αυτή είναι μια πρόταση., Αυτή είναι μια δεύτερη πρόταση., Και αυτή μια τρίτη πρόταση.]

    Submodule link.


  4. Stop words list. resource

    In computing, stop words are words which are filtered out before or after processing of natural language data. Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

    The stop-words wiki page is available here. The final list with the stop-words of Greek language can be found here.



  5. Norm exceptions list. resource

    spaCy usually tries to normalise words with different spellings to a single, common spelling. This has no effect on any other token attributes, or tokenization in general, but it ensures that equivalent tokens receive similar representations. This can improve the model's predictions on words that weren't common in the training data, but are equivalent to other words – for example, "realize" and "realise", or "thx" and "thanks".

    The norm-exceptions wiki page is available here. The final list with the stop-words of Greek language can be found here.



  6. Named Entities annotated dataset. resource

    For Greek language, there was no available dataset for Named Entities. So, we had to create our own annotated dataset using Prodigy. The annotated dataset is available here.
    You can learn more about NER and Prodigy in the following links: Link 1, Link 2.

  7. Lexical Attributes Functions. greek-spacy-support

    Each token of a spaCy doc is checked against some potential attributes. In this way, urls, nums and other types of special tokens can be seperated from the normal tokens.

    Sample input: Η ιστοσελίδα για το demo μας είναι: https://nlp.wordames.gr

    Sample output: Url: https://nlp.wordames.gr

    Submodule link.

  8. Part of Speech Tagger. greek-spacy-support

    You can use this submodule having one of the produced greek models in order to get part of speech tags for your tokens, independently of the others spaCy modules.

    Sample input: Η δημοκρατία είναι το πιο ανθρώπινο πολίτευμα.

    Sample output: Token: Η Tag: DET Token: δημοκρατία Tag: NOUN Token: είναι Tag: AUX Token: το Tag: DET Token: πιο Tag: ADV Token: ανθρώπινο Tag: ADJ Token: πολίτευμα Tag: NOUN Token: . Tag: PUNCT Visualized output using displaCy:
    For extensive documentation of POS tagger for Greek language, check the corresponding wiki page. Submodule link.

  9. DEP Tagger. greek-spacy-support

    You can use this submodule having one of the produced greek models in order to analyze syntax of your text, independently of the others spaCy modules.

    • Get DEP tags.

    • Sample input: Η δημοκρατία είναι το πιο ανθρώπινο πολίτευμα.

      Sample output: Token:η, DEP tag: det Token:δημοκρατία, DEP tag: nsubj Token:είναι, DEP tag: cop Token:το, DEP tag: det Token:πιο, DEP tag: advmod Token:ανθρώπινο, DEP tag: amod Token:πολίτευμα, DEP tag: ROOT Token:., DEP tag: punct
    • Navigate/Visualize the DEP tree.


    • Sample input: Ο Κώστας αγόρασε πατάτες και τις άφησε πάνω στο ψυγείο.

      Sample output: αγόρασε __________________|______ | | | άφησε | | | ______|__________ | | Κώστας | | | ψυγείο | | | | | | | πατάτες . Ο και τις πάνω στο Visualization code source. Submodule link.




  10. NER Tagger. greek-spacy-support

  11. Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.



    The greek language models support the following NER tags: ORG, PERSON, LOC, GPE, EVENT, PRODUCT. Having one of the greek models, you can use the NER tagger:

    Sample Input: Η εταιρεία Google έχει τα γραφεία της στην Καλιφόρνια.

    Sample Output: Entity:Google, Label:ORG, Entity:Καλιφόρνια, Label:GPE

    Visualization using displaCy:


    For extensive documentation of NER tagger for Greek language, check the corresponding wiki page. Submodule link.


  12. Noun chunks. greek-spacy-support

    Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world's largest tech fund".


    In the latest pull request noun chunks for Greek language are supported.

    Sample input:
    Η όμορφη ιδέα του άλλαξε την μίζερη ζωή. Sample output:
    όμορφη ιδέα την μίζερη ζωή

    You can view the submodule here.


  13. Sentiment Analyzer. nlp-task

    This submodule gives you a subjectivity score for your text and an emotion analysis .

    Sample input: Έχω μείνει έκπληκτος! Πώς γίνεται αυτό; Η έκπληξη είναι τόσο μεγάλη! Α, τώρα εξηγούνται όλα.

    Sample output: Subjectivity: 16.666666666666664% Main emotion: surprise. Emotion score: 33.333333333333336%

    Currently available only for the Greek language.

    Submodule link.

  14. Topic classifier. nlp-task

    This submodule is for text classification. It can categorize text in the following categories: Sports, Science, World News, Greek News, Environment, Politics, Art, Health, Science.
    Currently available only for the Greek language.

Future Work

In this section, some suggestions for future work are listed. There are difficulty labels assigned to each task and some guidelines to start with. There are also labels which explain if each task refers to the improvement of Greek language support or to the addition/improvement of a general nlp task. For more info on contribution, you can always have a look at the contribute page of the project wiki.

People