Localization Academy

How To Create Advanced Filter For Any Text File In MemoQ

•      Was this content helpful?

Was this content helpful?

Is memoQ your favorite CAT tool? In this tutorial, you’ll learn how to create advanced text filters in memoQ with our instructor Carlos.


Carlos García Gómez 

Hi, everyone, and welcome to the second part of getting started with memoQ, which is the first CAT that we are seeing in this video series. In this video, you’re going to learn how you can use the advanced text filter in memo queue in order to process any text file, no matter what the structure is. So let’s get started. Alright, so for this video, we’re going to work with this txt file. As you can see, the file extension is txt. So it’s parsing dot txt, and this is the structure of the file that we’re going to work with. The idea behind this video is to see how we can extract the strings for translation using memo Q.

Now, what we’re going to do first is to go to memo Q. And we’re going to work inside the project that we created in part one. So the project was my project, we can double click on these projects. And if we go to translations, we’re going to see the XML file that we processed in part one. Okay. So what we’re going to do first is to go to memo queue. And then using resources resource console, and going to filter configurations, this is the place where we’re going to work with in terms of parsers, or filters. Remember that these three filters are the ones that we created in part one. So now what we’re going to do is to create our text filter. So what we’re going to do is to click on Create new, we’re going to give it a name, which is going to be very similar to the naming convention that we use before. So we can say change t x t. And now under Filter, we’re going to specify reg X text filter, we can leave the description empty, as usual, we can click on OK. And we will see we have the NS txt filter. And on the right hand side, we can see that it’s regex text filter. And we’re going to right click on going to edit. And here we’re going to see all of the settings for these filter.

In the first tab, which is general, we’re going to see a few things about both page and new lines. Okay, and we’re going to leave this empty, it’s basically about the encodings. And the new new line separators. Down here, we’re going to see the reference files. And what we’re going to do is to add the file, we’re going to specify the txt, and later on under preview, we’re going to see how we can use this part. Now if we go to paragraph, this is going to be the main part we’re going to work. Alright, so we’re going to see first, why do we need the paragraph rules. And then down here, we’re going to work with content groups, context and comments. But for now, let’s go back to notepad plus plus. And we’re going to dive in the structure of this file. Okay.

So if we were to extract the strings for translation from this file, we would easily see that alright, the strings for translation are not going to be set. For example, set is not going to be translatable. And here, the same happens with Heather homepage, title homepage to title because as you can see, we have the kind of camel case here, and all of them are inside this same word. Okay, so this is going to be used as an identifier, which is not going to be translatable, what we are going to find inside the quotes. This part, this is going to be the strings for translation. So all of these parts, and we’re going to exclude for translation, the rest of the content. However, on the right hand side, we can see here, developer command, one developer comment to developer command, 3456, etc. And that’s going to be used a sacrament. The good thing with MOQ is that when we extract the strings for translation, we’re going to use the content group, the content group is going to be the strings for translation. But we can also specify if we have any context, or if we have any comments. So what we are going to do with this parser is that we’re going to extract these identifiers as the context. Alright, and we’re going to use these comments at the end of each line as the comments for each strings. They’re not going to be included for translation, but they will be used as a reference.

So what we can do is here for paragraph rules, we’re going to add a regular expression which is going to capture all of the lines, okay, so we need a regular expression capturing these line. And the same regular expression is going to capture these line, these line, these line, etc. But we’re going to use group So, with groups, we’re going to specify if the first group, for example, is the context, or the content or the command. And the same happens for the second or third group. So we’re going to use three groups in total. Now what we can do is to first try a regular expression using notepad plus plus. Otherwise, it’s going to be a bit more difficult with memo Q, although, and as we will see later on, once we have a regular expression here, under preview, we’re going to load this file that we use for reference, okay, and we will see the same thing. But down here, we’re going to see that right now all of the content is going to be imported for translation.

So all of the lines, as you can see, here, it says imported with this color, because we haven’t added any regular expressions yet. But the good thing with this preview tab is that we will see what is important for translation. And we’re going to, we’re also going to check what the comment is, in this case. For now, we don’t have any comments with this preview. And we will see how the context is also imported. So let’s go back to notepad plus plus. And first with Ctrl F, make sure that we have the regular expression mode activated, and we’re going to use reg x in order to capture each line. Okay. And what we can do is to make it very verbose. So we need to specify first that this is going to be the content for translation. This is going to be the context. And this is going to be the header. So let’s start with the beginning. So all of them start with the hash, and then set. And then we have the parentheses or round brackets. So we can use the hash here, set, all of this is going to be taken as literal characters, no regex here.

And now we’re going to add the round, ragged here, the opening round bracket, but the wrong bracket is a special character. So we need to escape it. Because we need to capture a literal character as we can see in here, okay. And in order to capture a literal character, when it’s a special character, we need to escape it with the backslash before. The same happens with the dollar symbol. So all of them have $1 symbol before the identifier or the or the context. So we are going to capture the dollar symbol. But again, it’s a special character. So we’re going to add another backslash before now. So up to this point, we have captured this part. And in fact, we can use five Next, we will see what we have captured. Okay, easy for now. Now what we’re going to capture is the identifiers.

And as you can see all of the identifiers have just letters in here. But we can also have sub digits, 123, and four. So what we can do is to use the shorthand character class, which is the backslash w, that means any letter, any digit or any underscore, if any. And we need to use a quantifier plus, which means one or more type, we click on Find Next. And this is the good thing of using notepad plus plus, because you can build your regular expressions from the ground up, and you can check how your regex is doing and then continue. Okay, so as we can see, we are capturing all of these parts until the identifier is finished. Next up, we can use a whitespace. Okay, so in here we have our whitespace, then the equal symbol, and then another whitespace. But my idea in general when using white spaces is that we’re not going to force a single space to be in here, because maybe sometimes we might have to choose spaces, or even none of them.

Alright, so what we can do, and then we actually undo this part, what we can do is to use the the shorthand character class, backslash s for a whitespace. But it’s going to be found zero or more times with these quantifiers quantifiers, which is the asterisk. This means zero or more times, then what we’re going to see, and we’re going to copy this part is that we need to have an equal symbol. This is the equal symbol that we have for all of the lines. And I paste it because it’s the same concept, any whitespace zero or more times after the equal symbol for any line. We can click on Find Next, and we will make sure that everything is working fine, as you can see in here. So next, what we’re going to capture is the string for translation. So first, we need both. We’re going to end up with another code.

And inside we can have any character okay because we don’t actually have any restrictions when it comes to the characters. We can have have letters, spaces, we can have less than characters greater than characters, we can have curly brackets, periods, etc. So what we can do is to use the period, which means in regular expression, any character, no matter what the character is any character, we’re going to use a quantifier, the plus symbol, which is one or more times, however, we’re going to actually use the question mark, in order to make it really, okay, we wouldn’t like to capture any character until the end of the line, okay, so we need to make it not really until it finds our code, which is going to be the end of the string. So this code, this code, and this code, etc. But after the code, we have the closing round brackets, what again, the closing round bracket is a special character, so we need to escape.

Okay, so for now, let’s hit on Find Next and see that everything is working fine. Okay, we can see that everything is working fine for now. Now for now, we’re not issuing groups, okay, but we’re going to use groups later on. For now, let’s just capture the whole lines. So now, again, we have captured this part. And now we’re going to work on the comment as such, right, so we have a space here, which is after the round bracket. The same happens here, here, sorry, here, here, etc, we’re not going to force it to only have one space, what we can do is to use the same concept. So backslash s for any whitespace, zero or more times, we’re going to have two forward slashes as we have for the structure in this file, for document.

And we’re going to use the same concept. Because after the two forward slashes, we have the space, space, space, etc. Okay, so we can have zero or more white spaces. And actually, let’s, let’s hit Find Next to make sure that we’re doing the job correctly. Okay, as you can see, it’s all good. And now we have the developer command, 12345, etc. But obviously, we’re not going to force it to be developer command and a number because we could have a real comment here as such. So what we can do is to use the period or the dot one or more times, which means any character one or more times, here, we don’t actually need to use the question mark to make it not really, because we want to capture the remaining of the string. Okay, so from this point up, we need anything that is left for this line, and the same for any line. So that’s the reason why do we don’t actually need the question mark, because we don’t want to make it lazy. Okay, so if we hit Find Next, we’re going to see that now we’re capturing the full line. Okay, we’re capturing here, the full line, as you can see for all of these lines, so that good.

Now, the next thing, once we have built the regular expression, is to use groups in order to capture context, not for translation for reference, and where the comment is in this line. So what we’re going to do here is to first capture the context. Okay, so the context is the first part that we can see in each line. So in this first line, the context is Heather. So what we need to do is to encapsulate in round brackets, the context. Okay, so just after the dollar symbol, as we can see in all of the lines, the dollar symbol is right here, we have the context. So this is the dollar symbol, we need to have a round bracket, an opening round bracket.

And in this case, we’re not going to escape it because we don’t want to capture a literal round bracket. round bracket is going to be used for capturing groups. So we have the opening round bracket. And then the cloud, the closing round brackets should be here, because we have any letter any deed or any underscore one or more times, and this is going to be the group encapsulated in round brackets. That’s going to be the first group. And this is going to be the context because it’s the identifier. Now let’s capture in group, the string for translation, which is found in quotes as we can see in here, okay. So the quotes are here, and here. So anything that is found inside excluding the quotes, obviously, that’s going to be the string for translation. Okay, so any character one or more times, and let’s make it let’s see, just in case it captures the whole line. Okay, we’re excluding the quotes because we don’t want the ports to be part of the string imported for translation. That’s going to be the second group. And now the third group is going to be the did comment, okay, in this case, it says developer comment. But we could have any real comment. That’s the reason why at the end of this regular expression, we can have the third group. So we can have any, any character one or more times until the end of the line. Okay, so we have one group here, two groups here.

And we grouped here, let’s actually click on Find Next, to make sure that we didn’t miss anything. Okay, we didn’t mess up any of the stuff when we worked on adding the groups, okay, and it’s working fine. Now, let’s copy this regular expression. And let’s go back here to memo Q, if we go to paragraph here on the rule, we’re going to paste it, we’re going to pay the rule. And we’re going to click on Add, we will see that the regular expression has been added here. And now once we select the paragraph rule here, this part is enabled. So the content group, we have the dollar symbol here, and we need to add a number. Okay, we need to see what the number of the group where we have the content. And we said that the second group, which is this one, that’s going to capture the content or the strings for translation, so we need to select Number two, the context is going to be the number one, and make sure that you add these to our symbol, that’s the way that we back reference, the the groups are right, so dollar symbol two is the second group, dollar symbol one is going to be the first group. That’s the context because remember that we have here, the identifier or the context that we need to import for reference, not for translation. And lastly, the comment. So the command is the group number three, so dollar symbol number three, here, we could even add the length of this part, but we’re not going to add any kind of length restrictions. Alright, so content group number two, context number one, and comment number three, we need to add it, and we’re going to see it right here. Now, if we click on Preview, we’re not going to see some differences as before. Okay, so the first thing that we’re going to see very easily is that the imported text with this color is going to be imported nicely. Okay, we excluded the codes and everything else. And with this color, we’re going to see what exactly is going to be imported for translation. Now in italic, which is the context down here, we’re going to see the context or the identifier. So we can see header, homepage, title, homepage, subtitle, bottom left, bottom, right, etc. That’s going to be the context.

And lastly, the comment here in bold, we can see that the last part of each line is developer comment, one developer comment to developer comment, 345, etc. Or actually, any comment any kind of text that we have be a right after the two forward slashes or at the very end of each line, we can see very easily how it’s going to be imported, before even processing the file as such. And lastly, under include exclude, under the include iron and exclude tab, we’re not going to use this tab. But just for the reference, we could have rules to define content to be excluded. So if you have any external tag, that is going to be used, as you know, as an exclusion, and in this case, we have, for example, the B tags. But these B tags could actually be inside the string. Okay, so imagine that we have a b tag applied for only the word courses. If we use that functionality in this file, then that would mean that we would have one segment localization, another segment courses, and then the rest of the segment. So it would mess up the segmentation. That’s going to be only for exclusion tags, and we don’t have any in here, we also have the option to define imported content, nothing else is imported.

So we could have any regular expression here, adding, you know, just the encode a content to be imported. But again, we’re not going to use this stuff, because we have applied all of the rules using the content context and comment groups. So we have the preview here, and we can click on OK. Alright, so we have modified the filter for this text file. Now the next thing that we’re going to do is to modify the regex tire that we used, or that we created in the previous part. Alright, so the regex tagger, remember that it’s going to be used to process the embedded content. And here we have some embedded content. For example, the b tag for the opening tag and the p tag for the closing tag. We also have some placeholders for example, old student dot level inside the call it the curly brackets here, okay, and we don’t have any more what we need to protect that part.

Now the thing is that we can use the same regex tagger, we don’t actually need to create a new one. So what we’re going to do is to right click and go to Edit. And here we can see the rules that we created in the first part of Mimikyu, which was for the XML file. However, in here, what we’re going to do is to add a new rule that is going to capture any of the HTML tags we can see here. Okay, so for example, we are going to do is to use these characters that are going to be taken as literal characters. So the less than and the greater than, and inside we could have actually any character. So in this case, the p tag, and here to betta, but with a for Westlands, something very easily that we can do with regular expressions is that we can capture any character.

So in here, this is a character class with square brackets, we could capture any character that is not greater than character, okay, because in the end, we want to capture the less than character, then any character that is not a greater than character, because the greater than character is at the very end of the expression, one or more time, okay, this could do could be used, and we could even use the empty that’s going to be taken as placeholder, or if we needed to, we could even use the open. But then we need the closing tag, okay, that’s why we’re not going to hardcoded to have the b tag, then for example, the the italic tag or any kind of HTML tag, one by one we can do is just to use the empty tag, and we did row expression is going to capture all of the HTML tags no matter which which tag we’re capturing any of them opening tag and closing tag. So this is going to be empty, and then require, and we’re going to add it, just make sure that you click on Add instead of change.

Otherwise, you’re going to change the selected regular expression or rule here. And lastly, what we’re going to do is that care for a student dot level, this this is captured in curly brackets. And here we have one of them for the XML file that we have inside curly brackets. But this is only capturing any letter any digit or any underscore one or more times. Now what we’re going to do is to slightly modify this expression in order to have it as this. So with these round brackets, we’re going to add a character class. And inside this character class, we could have, again, any letter any digit or any underscore. So this is not going to change the functionality of what we had before for part one in XML. But we could also have any period, because in this case, we can have also periods inside, that’s the reason why we can have any digit any letters, or any underscores, or even a period, we don’t need to escape the period inside the square brackets, this is going to be taken as a literal period.

And what we can do here is to change it because we are changing the previous rule that we created in part one, so we change it here. And we can see that it has changed. Now what we can do is to copy and paste this part in here. And we’re going to see what is capturing. Alright, so here and remember that we in memo Q with this kind of functionality MOQ is adding these tags are the beginning and these tag at the end, okay, this means that it’s going to be captured with these regular expressions. As we can see, it’s capturing the b tag or the opening tag, the closing b tag, and then these placeholder student dot level. Okay, that’s it. So we can click on OK. And the end reg X Tiger has been updated. Now what we’re going to do, and lastly for this project on all fortify, let’s say is that we’re going to create a cascading filter. So the same as we created in part one for XML with the edge XML filter and the edge stagger, we’re going to create a new one.

So create a new cascading filter, which is going to be called and txt with REG X logger. And here for the first filter, we’re going to use the reg X text filter which should be here regex text filter, that’s going to be the type of filter here on the right hand side. And then the name is going to be ENT three here and TX Deep. So for first filter configuration, we select and dxc choose the only one that we have. And that’s the main filter, because that’s the txt filter, specifying the structure of the file. And now for the second filter, we’re going to specify the regex tagger. And the filter configuration is going to be an array expander, which is going to be the NS reg X target that we just modified. We can click on OK. And this is the NS. txt with REG X tagger. We can even click on Edit. And we will see that it’s the reg X reg X text filter with the regex tagger that we can see here. Okay, that’s that’s all. So we have the cascading filter for txt, we can click on Close. And now we can go back to Mimikyu. And inside the same project, we’re going to process both the XML we proceed before. And now we’re going to add a new file for the txt, we can go to translations, and here right click Import with options. We’re going to specify parsing dot txt. And here we can specify the filter and configuration.

So in particular, we’re going to select the txt. And we’re going to look for the cascading filter. So ns. txt with REG X tagger, we specify this filter to make sure that we are processing the TXT file with the parser that we just created, or the filter that we just created. And we can click on OK. The file has been imported and processed. And now let’s just double check that it’s done the job correctly as we had in the preview. So this is the parsing dot txt, we can double click on the file. And here as you can see, we haven’t imported all of the content from the txt like all of these, we have only extracted the content for translation. As we can see here, we have even processed the B tags or HTML, and even the placeholder a student level, we can scroll down and we can make sure that everything is done correct. Now in terms of the context and the comment, what we can do is down here, you will see that we have for example, localization Academy, we have the context ID header, because the string for localization Academy has the header context are the header identifier.

And then the command is developer command bar. So down here, we can see developer comment one, if we click on for companies, for example, we can see that the context ID is built on right. So in for companies, we have bottom right, and then if developer command five, which we can see down here. So we have seen how to import the content for translation, how to extract dosage strings for translation using the advanced text filter, and how we can also use the groups with regular expressions in order to capture the context and their comments in memo Q. All right, so this is the end of the part to form MOQ. I hope you will have enjoyed this part and learn some new stuff using the advanced text filter with this cat tool in the third and final video. So in the third part, you’re going to learn how to set up and run through the translation and why that’s important when creating a project for translation. So that’s all for now, and I hope to see you in the next video. Bye

We’re always creating new localization content

Make sure you don’t miss anything. Join 3877 other professionals on our mailing list and be the first to get our upcoming newsletter. 

If you enjoyed that, you’ll love these…

Word Count Analysis 🎮

Watch this video and learn what is Word Count Analysis in localization using Cities: Skylines 🎮 Get ready to level up your knowledge of the localization industry!

Read more...

Why hello there!

Enjoy 10% off

on your first course when you join our mailing list.

* All information collected will be used in accordance with our privacy policy. You may unsubscribe at any time.