First steps towards Natural Language Processing with PowerShell

7 min readNov 19, 2018

I have moved to Madrid (Spain) at the start of this month, and — apart from the usual struggles with changing countries — this introduced a kind of new challenge to my life: the Spanish language.

You see, I speak fluent Hungarian, English, German and some basic French, but quite literally, to get my daily bread in the local alimentación, none of these help, not even in the center of Madrid.

So, I thought about challenging myself to learn conversational Spanish in one month — but why not expand my knowledge of the Microsoft stack in the same time? Three new topics (Spanish, NLP, PowerShell), here I come!

Theory

When facing a new language, sooner or later we have to learn a fair amount of words to create the base of our own vocabulary. While I know, that my brain works very good with flashcards if I want to memorize something, I still need to have a source of information for those cards.

As I have a given time constraint, I don’t want to take the scenic route for now, like in usual Hungarian language classes: learning words grouped around topics such as foods, then hobbies, then sports, then experimental particle deposition in wind tunnels — wait what?

There exist various studies, but in one they agree: usually 1000–1200 words make up almost our entire daily conversations, so I intend to learn them first. Here fits natural language processing into the picture, for which I need to supply a fair amount of Spanish text. Enter my savior: the Gutenberg project.

Preparing the text

After acquiring the data, it has to be loaded to PowerShell. Conveniently the Get-Content cmdlet is able to read more files with just one line of code, so all the documents can be placed into a single folder before start.

Based on source, the raw data can contain a lot of additional noise. It can be HTML tags and timing markings, or just page numbers and empty lines. To gain a workable text body, initial cleaning should be done and the easiest way to do it is with Regular Expressions.

For the sake of my beginner experiment I am not looking for a generic or dynamic solution — I value more my time for now. Simply, I apply specific cleaning based on knowledge of my chosen data source!

Cleaning the text

Now, the data has been refined, but that does not mean it is ready for analytics: it still holds punctuation marks and special characters. And thread carefully, as this is a language dependent topic now!

The French language uses special no-break spacing together with several punctuation characters, and guillemets instead of quote marks. On the other hand, in English, the apostrophe is helping the omission of certain letters and sounds or it indicates possession.

I am applying a basic regex to remove almost everything, without exceptions. I know, it generates some inaccuracy, but based on my guesstimation the leftover orphaned word chunks will not influence my top 1000 list, not in Spanish. I may however be wrong.

Word segmentation

After achieving a relatively good quality of continuous text, it has to be separated into words, to gain all the building blocks. Mind the small steps: not into individual words, just words, as keeping the statistical information is important! Simply, splitting the text on the white spaces will do, but only after removing the newly generated additional ones.

To have a different view on your status, I recommend to try sorting, so you can quickly validate your current output. Doing this, I have discovered, that I have forgotten to normalize the words: capital letters! Fix this with using the ToLower() method.

Analytics

The main goal of this experiment was to determine the most used words, so the next steps are clear:

The words have to be grouped by their form
Sorted by their number of appearance
The resulting collection has to be cut, to take the most used N words
The generated list should be saved to a file at the end

This could be done with a reporting tool (Excel, Power BI, etc.), though my second goal was to learn the basics of PowerShell, so I keep pushing forward. Validating my results shows, that I need to do another cleaning as I have spaces and numbers left — numbers had to stay until now due to names like “K9” or “R2-D2”.

Optimization

Although I have found my results appealing —doing a quick smoke test with Notepad++ search functionality — I have had some doubts: it shouldn’t be this simple. Processing a full 491-page-long book confirms my doubts: my script is indeed terribly slow, there is a lot to improve!

After sitting down and taking time to read various blog posts and documentations on the pipeline, script blocks, aliases, foreach vs. Foreach-Object and more, I see my newbie mistakes.

That said, I will leave my script here as it is, and challenge YOU to try and optimize it! If you are a beginner just like me, as a first step try to figure out how you want to measure the pipeline, and validate, which blocks are causing the traffic jam. Should the pipeline even stay, or it has to go?

Limitations, Points of improvements

The presented approach has many trade-offs, but still provides a good starting point. Some of my concerns:

Stemming: benefits of vocabulary versus word forms. Without reducing the words to their root, I could possibly end up with a list of 1000 inflected ones, made out of only 200 stems and their modified forms. Nice touch would also be lemmatisation.
Part of Speech: due to the method of picking the most used words, I could theoretically end up without a single verb. In that case — does not matter how many words I gain, and how big percent of the language do they cover — I would be unable to understand anything correctly, or even order a beer.
Domain: the words gained are extremely dependent on the source. I may end up with a list of technical terms or domain specific words. Although some modern fictions are good to boost up vocabulary, it’s not practical to have my first 1000 words of a language to be related to political intrigue (Game of Thrones)
Names: based on the raw data, the text may contain some names too. It can be the name of the protagonist — especially if he/she uses illeism, or the city, where the events of the book are played. However, having “Poirot”, “Groot” and “Ankh-Morpork” in your top 50 may push out some important adjectives

Extras: typo handling during cleanup, loosing names through normalization (Apple > apple), translating common abbreviations, memory consumption while processing large data sets, etc. The list can be quite long

Conclusion

My first observation is, that I should have started sooner with PowerShell. It is a very strong automation tool and a definite must in the Microsoft Cloud world. It is truly something, which I intend to explore further. That said, there is a reason, why it is not a commonly used tool for NLP: the community has already created tons of libraries and toolkits for Python, and nobody likes to reinvent the wheel.

As I started my dive into NLP, my mind has wandered into new areas with every single article and blog that I have read. Without question this is a very big and complex topic, which brought me to my second observation: reinventing the wheel. After I have put together my own list for Points of Improvements, I have read about Word Embedding and Text Vectors, which made me rethink, whether I really want to spend time in PowerShell to create a Spanish stemming algorithm. I’m not saying, that it will never happen, but it is not my next step for now.

And as for my challenge to learn conversational Spanish within a month. I am still optimistic, though probably I should use existing resources instead of creating new ones. This was a very good challenge to understand the specific characters used in written Spanish or to learn on stemming algorithm and how it connects/differs to the real Spanish grammar. New knowledge is rarely useless!

Finally, for vocabulary, there are amazing flash card collections available online for free — though they somehow also differ in their top 1000, so be a little bit picky ;)