Data Collection for AI – Raffaele Pascale From Venga Global

by Andrej Zito
on June 30, 2021

What is data collection? And how is it used in localization? Find out in this interview with Raffaele Pascale – the NLP Localization Program Manager at Venga Global.

Andrej Zito

Raff. Welcome to the podcast.

Raffaele Pascale

Thank you.

Andrej Zito

Good to have you here. Attempt number two.

Raffaele Pascale

Yes, attempt number two. But the second will be the best.

Andrej Zito

Yes. How are you doing, man?

Raffaele Pascale

Good, good. I’m doing very good. Thanks for asking.

Andrej Zito

Good, good. You recently moved down to this pane? I’ll start with the very basic question. How did you get into localization?

Raffaele Pascale

Well, I guess I got into localization many years ago, too many years ago, I would say 11 years ago started my career as you know, as translator. So I started with STL, with within, you know, as life science translator, because it was my passion at the time. And the well it’s still my passion, about one of my favorite topics to translate. So, after some months, there was a client showing the summer, you know, project management requirements. And then I did it, I mean, without any particular issue, and they proposed me to move to the management side. So I accepted the proposal. And I started my long career as a minor in the management side of the or not any more in translation. I worked for them for three years. So managing from life science to IT, wherever because I, I wanted also to explore, you know, as many fields as possible, and the management side, and then I came back as translator, because at that time, there wasn’t so many I want to say agreements that, you know, we can, we could solve. I just moved as a translator, again, because I wanted to move back. And after some years, I then actually saw that it was again, the time to move back. And I started at Venga. But while I was translator, I wanted to explore again, as many LSPs because it’s not only a matter of working, but also a matter of tools, you know, many companies needs the many tools to the user from the quality translation side. So I touched on those. And when once I was sure about my knowledge, once I came back as manager, and I started my career at Venga, so where I am recently, working as Program Manager, and as a data collection project manager more specifically, so but the passion for languages, it’s something that I’ve always had.

Andrej Zito

So yeah, thank you for the introduction. So spoiler alert, in this episode, we will talk about data collection, which to me sort of like a new field within our space. I’m still interested. You mentioned Life Sciences, your passion for that, how does one actually develop a passion for life sciences? Is it just that you like to inspect bodies?

Raffaele Pascale

I like the reading from medicine, so and a good translator is a translator that knows what he’s talking about? Yes. So that’s why I was very skilled into this, because at the same time, I was, I had also my sister in a nursery, so I was, you know, bombarded from a medicine from all around. And yes, I always loved this aspect of the science, you know, the medicine, life science, biological stuff. And yes, that’s why I started that directly with life science, but I did I mean, some tests initially because when you were a translator, you do some tests to see whether you are more skilled and also the tests demonstrated above life science, aptitude, so it’s you know, it’s been always something passionate, I always say cultivated during the time,

Andrej Zito

Is there something like a data collection for life sciences or is it mostly suited for IT project?

Raffaele Pascale

Well, I would say is mostly variety projects, we need to think that when we speak about data collection project is always related to artificial intelligence systems. Now unless they are not preparing someone now operating in a machine way I mean, from an artificial intelligence it’s possible everything is possible okay. So, that data data’s data collection are needed to build an artificial intelligence system, it can be a recording one, it can be translation one, it can be a writing recognition system one. So there are there are many, but definitely you can prepare maybe TM with the machine translation from life science. So helping you translating that third dose contents, when you want to start it, you know, I can’t and

Andrej Zito

So is that what you would say is the purpose of the data collection to train? AI?

Raffaele Pascale

Correct. The purpose of data collection is to collect all the data to then put them into a machine, and make the machine learning from all those data. And learning, of course, what you’re teaching. So if you are teaching the system to recognize your voice, you will put the recordings within it of all accents of people that depending on the language that you need, if you need a machine translation system, because it can be an artificial intelligence that translates for you, then you need to put translations inside, of course, always coming from human because a machine needs to learn from humans initially. So if it’s writing here, you can teach the machine to recognize writings from people differently. So just Canada copies and you put the insert within the machine, and the machine will organize it. Even if if even for image translations. For example, if you want the machine translating for you, images, real time, you can train the machine to read the images, translate those images, humanely, and then put in the machine and the machine will be built that you know, for you to translate in real time, something that appears on the screen. So there are some systems already existing that you can use about that.

Andrej Zito

Right. So AI, AI is a fancy word these days, everybody’s jumping on the bandwagon, maybe for the people who are not familiar with it, try to try to explain to us why you think we should be training machines, a lot of the people you know, are scared of for their jobs.

Raffaele Pascale

Sure, because of course, if you try if you prepare a machine to do the work for you, actually, it cannot be done dangerous for the employees. That is not always true, we there is always a limit for our machine on the aspect of the quality, we cannot pretend that the machine is under percent quality as a human can do, because it always needs our review. Even if it’s light, maybe we can call it light review. But still a human review, it’s needed before then the content is published on the market today can be used for whatever voice recognition system or whatever system. So maybe I mean, the companies, the need of building an artificial intelligence surely is to be scalable. So today, you’ll build a system helping you to generate a service, that actually helps you to decrease the times for those requests, it can be a translation. So maybe a company can be at ease our machine translation, because they know that they translate always the same content, always the same specific field. So once the machine translation has all the glossaries, internally of all the languages all the translations, then it’s easier to provide a draft to the reviewer than to provide a source will be translated from scratch. So you see that the times are super reduced, even if maybe it’s just a matter of a sentence that you need to publish because of a spot so it’s easier for the company to to do that to have. So that’s why today they are also systems with my surprise, artificial intelligence providing copywriting. Yes. So it’s unbelievable.

Andrej Zito

I used it very recently.

Raffaele Pascale

Yes. And it’s super interesting, the way that this like I cannot very human, it seems very human. But we really, we really, really scared to be repetitive. I mean, super many repetitions during the time, because the computer will have a certain limit of sentences that will provide. So let’s always be sure that we’re using the machines not always, just because it can be a good draft. But definitely I’m always for the human hands. I mean to double check before being published, accountant to be published.

Andrej Zito

I already did a few episodes where I sort of try to learn something new, where I don’t talk about the things that I know. And data collection is definitely something that I have almost no idea about. And like I mentioned, it’s I think it’s a new field that’s emerging in our industry. So just like we tried last time, I think it would be great if you sort of walk us through through a process how a data collection project works, like because that’s how you started in the data collection right as a project manager. So so so right now I can just make a parallel to standard translation localization project. You also have experience with right like customer sends your request, they send you files for translation, you have a tm, you have a glossary, you send it to translators or you do machine translation, then you set it for posterity and it comes back then you do review and then you deliver. Okay, so let’s say that’s the baseline. So I wish you wish, maybe maybe if you can sort of guide us slowly, step by step through the through the process of data collection, how does does it initiate the same way as a translation projects that the client reaches out to you with a certain request and specifics? Yes. What do those specifics look like or do request in a general level?

Raffaele Pascale

So basically, you have a request comes the same as a normal project with an email source, and no references. So it can be an overall project for some of my colleagues. It comes without references because yeah.

Andrej Zito

This is this is where I’m going to ask a question, because you mentioned when it comes with a source, is there actually a source for data collection projects?

Raffaele Pascale

Yes, the source is always random sentences. So because you need a basic where you need to prepare actually, then the strings to be translated, you know, and usually, it’s English, the first language to be used also, because of the programming side of you, the programming inside the machine learning is in English. So it’s more useful to use English as a source language, like let’s say standard language. To start with the translations. If you are preparing something multilingual, of course, then depends on the language that you’re preparing the machine translation, or sorry, the artificial intelligence system, we are not talking only about machine translation system, but wherever artificial intelligence because it’s a data collection. So the source depends on the scope of what you need to build, let’s say, you are building your voice recognition system. So there will be strings to be recorded, which strings are randomly because the system just needs to, to hear the voice, the accents and the language, it doesn’t matter, which is the topic because here you’re preparing the system to just recognize the voice. But then if you need a translation system, you will need to put translations, you’ll decide on the basis of what you are preparing, you decide the language. So then you will need to prepare the target into that language. In this sense, you just prepare the machine to translate randomly sentences in that language, we need to specify that because then if you want to be specifically like consistency, or terminology, consistency, the translation of terminology consistency within the paragraphs, or even, for example, other parts of specific translations, for example, there are also genders, I mean, the machine that cannot learn agendas alone. And we can teach the machine to use a specific agenda, but one not both. So and that’s the problem, I’d say that’s the second step. Because then depending on the machine that you want to build, you will need to add the details like let’s now train the machine to recognize the genders then you will need a translation plus labeling the project because you need to teach the machine which is in English the pronoun which is the translation, which is the the pronoun in the translation with the labelings in the source, you know, a p let’s say pronoun and in the target you put a p around it by brackets, this is a labeling the teaching the machine a daycare because these pronouns will be translated into this language and is there so the machine will also teach the genders then you can play Yun, you can imagine that you can play with whatever pronounce blue roles, genders. So, whatever you need, then to teach to the machine. This is important because it’s you know, when you build something and you want to be as perfect as possible, especially in artificial intelligence system, you need to reproduce those systems mechanically even if we are talking about language and the language is built on the mechanics You know, there are roles. So hopefully we can teach these. Now, the challenge is barter comes when it you need to give context. So the machine is not always able to use the same terminology to wise in the you know, into paragraphs divided by two full stops, for example, then you can train the machine to do in context translations, for example. So this is the way that actually a machine works. The rule that you need to teach you should be reproduced with the help of a human not necessarily I mean, it’s not It’s not possible to avoid this step. And again, it depends on the machine that you for hand writing, for example, you can ask whatever, you know, if you need the hand writing system to be multiple languages, you ask people from multiple languages to write for you sentences, always a source, so that those sentences will be then put in the machine and the machine will take simply pictures of that the graphs, because the machine in that moment should just learn the graphs, not the terms or the translation or just the graphs. So, the source depends always on the scope, then, for example, you want to use a machine to translate for you immediately. Sports, medical, let’s say medical advertisements, medical science, a signal signals that are in the hospitals, then you would direct your source in all life science. Because what the machine should learn is all the terminology about medicine and in that sense, you will not need to teach the machine financial staff economic stuff, you know, then your streak to the field to the one that you need more. So, it depends on the scope. Yeah.

Andrej Zito

Does it mean there always has to be source? Is it possible that the client just come to you like, hey, I want to build this thing, but I just have this idea? Can you build everything from me? Or do they always have to provide you the source?

Raffaele Pascale

This step is not something that as as LSP, you know, besides I mean, we don’t decide the which artificial intelligence system our client needs, but we help to build it, for example, a client, hey, I am here and I am building our voice recognition system, what do you recommend to us to start with, so we can recommend, okay, let’s start collecting the strings. So with the language that you prefer, the number of strings is important, because, of course, you cannot pretend that the machine learns with 50 strings, it’s really tiny. So you decided the volume or we assist the clients in the sense, but not on the system that they need to be in I mean, we are not, but if they, for example, want our assistance on a building scripts, or tools for checks, this is something that we do, for example, but you have to have, I mean, as a program manager, it’s needed to have experience also a little bit of engineering experience to know what is possible and what is not in terms of checks. Because there are some stuff that it’s not possible to do without a human, still a human here is needed. So and there, it plays a big role in you know, the fact that you put together a curate all together with a human to provide a very good quality in the end. So this is the way we can build the for example, bolts we’ve been told Sir, I like him, we can be the tools for quality checks, but not for the building of the machine. I mean, of the artificial intelligence system. Sorry, but I’m

Andrej Zito

Sorry, but no.

Raffaele Pascale

Yeah, AI to me is always machine translation. But what I want to say to the people thinking about, you know, that humans will be replaced, we can really forget that this will happen. Because even if the machine translation will give a quality to say, Hey, this is super good, we can see that some of the existing machine translation systems improved, really a lot, I mean, during the last three years. So, we can reach a point where the machine translation will be good, but the lead human needs to put the details because a machine cannot reach such detail. In oriented translations, it’s impossible to provide the same that a human can do. So even if the sentence maybe you need to modify our space, but still, it’s a space that a human should check, you see what I mean. So, the machine cannot hundred percent replace, maybe one day I can see the the figure of the translator changing into be an editor, this is something that I that I see, because they will be more and more asked to perform post editing instead of translation from scratch. But then if these like, if the demand of the PA of the post editing will arise, I think that also the price is really high. So because this is like the economy you know works like these. So he cannot say you know I cannot they cannot even pretend to have good quality with low Rate neither. So it will be a sort of change. But, you know, it’s like.

Andrej Zito

Let’s go back to our imaginary process. I think last last time you were talking about scope. And my question was, you know, for typical translation projects, we say something about word counts. How do you evaluate the scope of a data collection project? Is it based on the number of strings?

Raffaele Pascale

Depends always on the scope, but we can base on word count, we can based on number of strings on a per hour. So and the depends on basis, because it’s also possible that you are at almost the end the stage of your machine translation system, let’s say, and you need to test the how this translates. So you want to provide ratings of the machine translation, for example. And in this sense, this is a how early task, because he cannot we cannot estimate how much time a translator can spend the reviewing, you know, on the basis of the word count, it cannot I mean, on our case on my case is to me could be could be unfair. So, but it seems fair to evaluate these on an hourly basis, because it’s time I mean, there will be time to check if your translation is good. If so, the accounting search change on the basis of the scope as well. That’s why I said I wish it could be as a normal translation project, but it’s not. I mean, there is a new digital project cool. So the quarter, you know, it starts from the beginning from the quarter because the quoting step is the one where you define it this will be what counter this will be hourly rates, these will be per string rate. So the quoting phase, and you need to be including face forever and ever because the scopes keep changing. I mean, you start with a client on a certain step, but then during the time you evolve, and then you have no other screen steps and steps and steps.

Andrej Zito

Okay, so let’s say we have the scope of the project, then then how do you find the right people for the project?

Raffaele Pascale

So we, on our cases, we rely on supply chain, a team that helps us to onboard the linguists, but the always the supply chain management team in need as much as information as possible about the projects that they need to onboard the linguists, because of course, linguists will never accept everything like yes, yes. I wish again, but hopefully they read it, they read the scope, they see if they are you know skilled for that. It can be possible that the linguist is not always skilled for for a project, and then the supply chain acts as a within the entire process within the important thing is to pass the scope. As we said before the scope is the main important thing that helps you once you address the scope, you have in your hands, the supply chain part and also the Quality Strategy part. Because knowing the scope, you can put in place a stronger quality, you know, strategy that helps you to review it and to deliver good quality to the client requiring that project. So the scope is the main point of the data collection project system is the focus of everything I always say because once you understood the scope, then you’re done, you can put in place a plan otherwise, no, because it’s not a normal translation project, even if it would be a normal translation project. We have a style guide as all other translation project have, but we don’t have a glossary. As you can imagine we translator randomly sample strings or with record the randomly sampled strings or randomly sentence in a poster in an image and they are randomly so there is no glossary there is no reference at all, but you still need to provide the terminology on the basis of the context of your sentence, because this is the difference while for normal translation project the translator should the master follow the context for data collection project you should not follow the context at all You should translate what you see in the string then you go to the next next string next chapter rates something completely new, even if it might be related only for we struct to take in context previous and next. So okay, otherwise the standard rule is to not take into account the context because this is the way our machine should be trained. You know initially, just to understand A is A, B is B, C is C. You know the basic then you go with advanced steps. On the basis of the machine that you are building, so if you want to continue your machine at detecting and providing good quality translations, then you go with incontext translations. And then on the basis of the quality side, we need to always double check that the reality was understood as well the scope. So usually we like, I mean, in my binder, like my main focus, is to have a kickoff call with all my linguists, both linguists and the reviewers working on it. Because those scopes to be explained within a style guide. It’s really tricky. And it could be confusing or there could be you know, it could be clear for you because you have the knowledge of certain languages, but maybe not for a Hindi maybe not for Japanese, we work with one other than 10 languages all around the world. So we all went under than then grammars. So the scope that can be clear for one cannot be created for another one. So the kickoff call helps us to reduce Of course times, and to have a real time communication with all the linguists and the queries. Because the second step to take into account is not only the management, but the quality, you need to provide good quality you are training a machine, and the machine that will give the results are what you are providing now. So we need to be sure that the quality provided is 100%. Good. And it can be published wherever in the system.

Andrej Zito

So you mentioned the linguist, but from my very limited experience with data collection, in my last job, there was another team who was working on data collection. A lot of their data projects were simply sort of crowd sourced to people. For example, when you mentioned the voice recordings, you don’t need linguists for that, right? Because you just need people in different languages with different accents dialect, just talking that language and use that to train the machine. So do you always need the linguist? Or is it just for a specific project,

Raffaele Pascale

I would say we always need professionals so that it’s even better. I mean, because it’s true that you don’t need the translators, you don’t need the linguists for recording, but imagine you need to translate 750 strings in 100 languages, you cannot pretend that all those people involved the will follow your rules as you instructed the Irish again, so Oh lamotta wishes, but we cannot live our wishes, we need to be sure that those strings are recorded and recorded properly without any whispering without any noise on the background because other otherwise the machine will have those distortions inside. So you need to prevent them are all still human quality, you see still human check. Because you see the human cannot be replaced. Even if you are called sound sourcing, it’s not always show that the quality providing you from the crowd sourcing, it’s the one that you need for the machine. And that’s why in we I call them a linguist, but they cannot be linguist. So we usually use linguists also for dog Damn, because we rely on the professionalism that they have. So we can like do it a ways, but definitely not that way you will do with a non professional recording for you. And for those I built it together with our engineering team, our specific dollar for recommending for recordings of strings, because that’s what I mentioned before we can help to streamline the process and build the tools that maybe have altogether the requirements of the client, the strings should have opposed one minute before one minute after. So you build the system on a way that the bosses are very present because nobody never be will respect always this one minute silence and then speak and then one minute stop. And then guest offering it’s I mean, it’s also a matter of concentration, I understand it. So you need to think about those are kind of, you know, issues during the time that you can face. And definitely you cannot review when other than 10 languages because you don’t even speak all those languages. So you will need one language leader for each language, at least checking for your sample sentences and steel are professional because again, you cannot rely on someone to check your quality, someone that it’s not used to do those kinds of jobs, you know, we really do not feel confident into crowdsourcing this kind of of stuff. This is again because again, it depends on the scope, it depends on the quality. And then for the feature. So the people we see at the localization so that are that can be linguist but also computational linguists that are even more important for us than the linguist itself, then there are lsus, then, and all the others that can can come

Andrej Zito

What do computational linguists do?

Raffaele Pascale

Yeah, the computational linguist is the person more skilled, that gives more advices on how a sustained Zola should be built to build the artificial intelligence system. Let’s say I’m building a system in Italian, and I need to build this system recognizing to me all the masculine and feminine me as a linguist, a computational linguist, I can instruct you, from the linguistic point of view, we should look at this, from the engineering point of view, you should look at that. So it’s one person having two skills together. So computational linguist, that’s why they are super helpful for us into the advanced level aware, for example, there are labeling projects that we need to put in place tools, this is super useful, because the computational linguist can already see where there could be issues from a language you see. So those are the skills that you need from a computational linguist. And it’s, it’s super funny. I mean, I love speaking with them will die. Because it’s a long lasting goal of dice.

Andrej Zito

Are they like a scientist?

Raffaele Pascale

They’re the I call them, or they’re the like, linguistic narrowness. It’s not the language bit of the original language. But definitely, it’s a matter of knowing very good your language, very good. The engineering stuff behind the language. So what on the computer and patient side of you you can do with those machines? It’s, can I obtain that I machine translates to for me, are one. Why don’t one noun in two genders. Yes, it’s possible with this role? So that’s what a computational linguist can help you.

Andrej Zito

You started talking about the tools? What I was wondering when we talking about the translation, where you collect the translations? Are we using the same tools that we use for normal translation projects like cad tools, or TMS? Or does it need a completely new set of tools?

Raffaele Pascale

I mean, it depends on the scope, or the Rosetta, but you can I mean, we built the specific tool that translated of data collection project. So because there is still of course, and I need to admit that these also these, like, I can understand that the approach, but translators are very tempted to apply the machine translation now in whatever tools, you know, in SDL, trados molecule, there are plugins that you can apply, whatever my memory, Google Translate, Amazon translate. So we need to avoid that. Because if we use the same that our machine is using, at the moment, we are not teaching anything to the existing machine, because it’s always shorter, you know, it’s already show that train a machine translation system, I would use, what are the machine translation system already use, okay, so even if you use Google Translate, Amazon, translate Microsoft Translator. So those are all machine translation available free for us for from everyone. But the lead machine will not learn too much from that, because it’s a machine generating the strings, we need a human translating the strings from scratch. So we we build, I call it a translation tool. It’s not a CAD tool, because it does not add any, any assistance from any tm because we don’t need it. It’s randomly select strings. And you don’t need to use the same terminology. Pay attention, super important. So and that’s why the AMA is completely out of discussion. And definitely, we disable the copy and paste function. So at the moment, the linguists can translate, but they cannot copy and edit paste. So Ctrl C Ctrl. V, it’s disabled. Because if they won’t, they can use manual. So if they want to manually both in that, and that, okay, feel free to, but we need to prevent from any users of any machine translation, because otherwise, what’s the machine translation will say you, let’s say I have five strings, okay? And they pre translated those five strings. So with the help of Google Translate and edited or strings, maybe neither that never needed, edited, because they were so simply that the machine translation translated it perfectly, but still a human could do better. We know that a human could always better when we bought let’s say, three IP translated with the machine translation and to have been done manually. The machine translation system will recognize only the two or The five, because the three of them are were already generated from a machine. So the machine is not learning anything. That’s why we need to prevent any machine translation usage from the linguists. Otherwise, the machine translation will not learn anything new, at least. So they will put string that the machine translation already contains, you know what I mean?

Andrej Zito

I know what you mean, I get the idea. But maybe this is a very basic and stupid question for you. But but I’m trying to understand. So what is the profile of the client that comes to you with request? Because you mentioned that we’re not teaching the machine anything new if you’re using Google Translate? So does it mean that let’s say their their requests come? And they already have, let’s say, Google Translate as a basis, and they’re trying to teach it something new.

Raffaele Pascale

Let’s say let’s say that Google Translate is in any case, our system and artificial intelligence system, okay, now it’s Google Translate, because it’s the most famous But still, as a machine translation system. So any machine translation system has a basic, let’s say, of a sentences and translations inside, you know, because it’s easy to put the subject or verb object and this is the translation of these, these and these so wherever basic machine translation as a simply translations inside this is because it’s only a matter of logarithms behind. So it’s really if you understand the grammar of English compared to the grammar of a European language, you can train the machine with translating a basic very basic I am Raffaele and I live in Fuengirola. So these sentences super easy for nearly bernal let’s say, machine translation. Okay. So and this is what Google Translate already has. So if you are putting those kinds of translations, because don’t imagine super complex strings, they can be strings of four words, even two words, or 10 words button both the easiest words you can imagine, because the machine needs the simplest of the best, because then the machine will learn as many possible with the simply constructions, but as many constructions as possible. So it’s quantity and not quality in the sense, because we are reaching the cold, I mean, on day the machine translation will reach that quality. So initially, they will always building something for machine translation, we look at the quantity and the quantity on the quantity side, we don’t want the linguists using an any existing machine translation, because we know that the basics is stealing the machine translation of any client.

Andrej Zito

Okay, so to extend on the on the question where I’m still confused. So why why do clients even come to you? Like, what are they trying to accomplish? Why Why do they need their own thing? Do they need their own machine translation engine? Or

Raffaele Pascale

Yeah, correct, because they improve the existing machine translation engine, or they improve the existing voice recognition system, to maybe recognize the voices of impaired people, for example, the basic of the artificial intelligence is there, but then it needs updates, you know, so to reach, really intelligence how to be really intelligent, it means time. So clients come to us to improve that machine translation that they built during the time. So wait before with the no specific language, then with specific languages, so then expanding on whenever they need, maybe they show that they are having a lot of requests, from let’s say, Arabica from translation from Arabic into Chinese, and they ask us to prepare the man to translate sentences from Arabic to Chinese, and then they will prepare the system to translate better those strings. So, they come to us to improve the systems that the moment they have. So because we can help her to prepare and manage all the calm that needed to be inserted within the machine, it can be also on ecommerce sites, for example, that translating for you to words, because they are preparing a terminology base for their ecommerce platform for example, and there will be always the same product so bags, mobile phones, agendas, mouse’s you know, so once you put the terminology base within an artificial intelligence system, he can then manage it wherever it needs. So in a translation of a description or provision of a title, depending on whether it needs and we help the machines to translate this better. So with the human translation that we provide is a little bit As the machine will learn from those human translation, and will teach translating better. So if you saw improvements, for example, in Google translate from English to Spanish in the last years, it means that in the last years have been published within the machine translation, a number of words that had the machine to translate better, but it still needed the human part. So humans still play roles.

Andrej Zito

Are the are the data collection projects, sort of like an ongoing program? Like let’s say you have a client, you help them improve the machine? But is it the process that is never ending? Or is it the process that actually stops because the machine has learned, let’s say, enough?

Raffaele Pascale

Well, I can tell if a person cannot learn enough imagine a machine so. So yeah, no, I mean, this is another ending, because it can be wherever specific that you want to go, maybe you finish the preparing, let’s say your machine translation translates so good English into Spanish, because you focus focused only in that language, and my but maybe you can still have this machine, the not translating for your description of hotels, as a human. So you can move on to teach the machine to translate as a trance create or Voodoo, you know what I mean? So putting transcreation, so to give even better human translations to the machine, because then there are as many studies as you want to improve the machine to give the best you let’s say, of the results, the the most close, you’re the closest to the human eye, because then this is your goal, you want to replace completely a human with with an artificial intelligence, at least to the service of a human, if we’re talking about translation, you can do this in many places, I mean, once you start, you can then improve the batter, you know, as much as you want. So, because that’s the point, I mean, there is no limits to what the machine can learn. So you can then once you start with the translation, you can decide to merge this translation step into recording, so you can use the same strings at the machine learning to translate to be heard. So you will then start recording them and build the voice recognition system able to recognize the voice and even translate for you in real time. And also, there are so many things that you can know, out of our imagination, it’s a shame. It’s a shame.

Andrej Zito

You know, so when we talk about machine translation, if we apply machine translation to the standard translation project, it’s like AI is helping the human Is there something like a AI helping data collection project?

Raffaele Pascale

No, because you can generate an AI through a data collection, but you cannot have an artificial intelligence system or generating data collection for you, it’s the machine doesn’t even have those data you need to put inside the data collection is always human. So, we need to understand that even when data collection is merely a collection of data done by it depends if you have a formula for you extracting streams randomly for whatever. So you can do that automatically. Maybe with another artificial intelligence system, you can collect the data that helps you to collect the data, then you will build the your machine does your artificial intelligence system for whatever you want through the source you collected. It can be images, voices, there are also even musics I mean, that our clients are requiring to double check if the machine is able is really able to read properly from the song that lyric. So because in this sense, it it’s not I mean, you can put in Spotify, whatever system and have a lyric right away without anyone putting inside the daxter and everything or and once you trained on songs, you can imagine that they can extrapolate from you can extract from videos wherever, you know, dad, maybe or whatever video you want that one day, because if a system can recognize from songs you can imagine from a normal video, so it’s not I mean, artificial intelligence needs data collection. I mean data needs data collection process to be bailed. And I mean, and the data collection need the humans to be managed. In any case, if it’s translation, if it’s collection, if it’s whatever, so that data’s need to be done by human. And collecting strings, for example, depending on the scope, maybe you are collecting strings only have a specific field, let’s say so it’s still human only, you know, that you need to, this is called the duration about because maybe a client can go can ask also for that, we need to train a new system that is able to recognize strings that are with the family subjects are talking about the politics in Ted conferences these days. So you will need a specific I mean, you will need a linguistic collecting for you those strings from whatever that video, whatever that blog, whatever, copying and pasting but still manual, because a machine can, you know, you can put the noisiness creep, extract for me strings from a pad but had we know that that x, you know, there are many, so it can be rubbish, you know, that you collect and you spend maybe more time cleaning than collecting. That’s why humans are always needed.

Andrej Zito

Yeah. So speaking of humans, you mentioned that one of the key value of people is for the quality management. So how does the how does the quality- How is it performed? So you mentioned that you have your own tools for translations? Is it where they also do the review? And how what do the reviewers actually look at? Like, is the string translated correctly? Or do they look at some different specifics for data collection?

Raffaele Pascale

And us or reviewers will have the same exactly style guides and references of the translators? Because they need to check actually, for the translation has been performed as per client requirements. Now, what they check specifically, is that they didn’t use any terminology cost consistently. So differently from the normal project. Yeah, this is amazing, because Oh, god, yes, yes. So the terminology should not be consistent, they see that the pronounced agendas are not consistent, but they are only in the sentences, you know, provided. So then, you know, if in the next sentence, for example, you have a sentence without any reference to any gender, but in your target, you still need to use a gender, then there are the review where we’ll double check that the translator didn’t use the same agenda of the previous thing, but the gender that the client provided. Now we can open another big discussion on the inclusivity of the gender chosen. But maybe we can have this as because there are clients using the masculine using the feminine, but we are excluding one or the other one in both senses. So let’s pay attention when we prepare artificial intelligence system with those because we need to be inclusive also in the sense.

Andrej Zito

Yes. So I don’t want to open the can of worms, because you would probably go for for a one hour. But like, like, if you can try like in, I don’t know, a few minutes, like, what would be your recommendation to the client when it comes to the inclusivity? Like, do you? Do you try to tell your clients like okay, like, let us translate? I don’t know, let’s see, Doctor like in one string gets translated this masculine and another, it’s translated this feminine? Or how would you advise the clients to be more exclusive to train their machine to be more exclusive? inclusive? Sorry.

Raffaele Pascale

Why do we really so a big difference for example, in recently Google Translate providing both foundations, which is so masculine and feminine, this is still a very good, but we are excluding any case, non binary people. So I need to I mean, it’s, it’s true that these should be taken into account because artificial intelligence systems are also used to build the, for example, the boats a chatbot survey chatbots. So and chat bots are used mostly for, let’s say, HR systems or HR companies, because it’s easier to go through the initial stats, you know, with the questions about the nationalities, so about these assistants also. So it’s easier to go through, but there is a risk that that chatbot will refer to that person as a masculine one, actually, this person is not masculine, or vice versa. So we need to put and stress the attention on the trying to teach to the machine to identify the genders. At least putting them in, then we can work together on how the machine can identify them with the labelings with everything that mechanically can do, but let’s start before considering them. Because there is the big artitude to consider, like, something that for ages has been and I, I talked about this in my last webinar about when the rest of the mislocalization, because Microsoft, for example, teach the US for so long, but to use masculin, where gender was not, you know, we were not able to provide a gender in the, and this is been a bad attitude, I mean, that we use for so many years, but we are now realizing it not so we can change things. And my suggestion is, let’s start at least putting different genders. You know, when, for example, and maybe when it’s not, there is the new gender, for example, in English to be used for now, which lets maybe, in the translation start with this first, like step, because I think that from the translates when, well, from the localization point of view, we play a big role, at least in the digital account answer because we know that 90% of the digital contents are translated from English. So all the translations we see are, you know, all the contents we see are translations. So the translations can still be glossy. So please, let’s do that.

Andrej Zito

Okay, so but then would you would you would you prefer to have it all, let’s say neutral? Or is there a way to like, let’s say, translate, once masculine and feminine non binary?

Raffaele Pascale

At the moment, there are existing study guides that are called the diversity and inclusive study guides that depending on the language, because you cannot decide One for all, of course. So, yeah, for each language, there are a diversity and inclusive OSI guides where, you know, it instructs you how to go when this pronoun, for example, is not, well, sorry, this sentence is not a specific pronoun. So you cannot know the gender you cannot know. So what which is, maybe the natural pronoun that at the moment in the country is being discussed to be because we know that there is no official choices. But there will be anything official, until we started using it, we know that I mean, the language will become a role when you start using it. For many times, we saw terms use the inner spoken language, but never existing in a dictionary, but then the present in a dictionary because of the usage of that word. Also, that’s the same thing that should happen. I think, way the inclusive pronouns and inclusive translations. So if the clients are interested into these, there are definitely solutions, that at the moment are still in place. Many vendors are already well, many clients are already using those diverse and inclusive style guides. Because it’s a matter I mean, people not that, but also the language that they use with people matter more. I mean, more than eat whatever communication mean, that we can think of, because language is the first bridge we have with a person. I mean, so it’s important the way that we refer to that person, in whatever way, human Lee mechanically I understand that, you know, computers need to use chatbot I’m not against, but let’s at least refer to the people on the agenda that they are so and these can be done only if the machine knows those pronouns and those stuff.

Andrej Zito

Right, well, where can people find these style guides? Is it something that you created internally? Or is it something public?

Raffaele Pascale

No, it’s something public, I can pass it to you the references If you want, I have at the moment, style guides for English, French, German, and Italian and Spanish. So the most European language the well the most us European languages, and battle we can I mean, those are just the ones we at the moment worked on, but we can move I know whatever I know that. For Hebrew that is there will be soon a new one because they reached an agreement in finding a new pronoun in writing and the spoken language because you know, that Hebrew language as a very, like, let’s say, very curious way of writing I love it. So you don’t need to you don’t have all the barriers into spoken about also into writing. So and this will be so for sure. There will be a study guide. Also for Hebrew bots, there will be more and more I mean for each language. And Studies demonstrate that at the moment in the world we have more languages biased by genders, then languages biased by nature, because the 55% of languages have agenda at the moment while the 44% have an agenda. So this means that people, well, most part of the people is used to think in two genders. So, this is something we can open another discussion about, it’s like in another talking, but it’s definitely something to take into account. What if we talk about translations, if we talk about language, then we need to take into account that our machine, it’s true that the machine doesn’t learn these can go wrong with that can bout Let’s start from the beginning who is teaching the machine What? So if we teach the machine to use only what gender don’t pretend the machine, we use double genders because the machine doesn’t have it. So if we are not putting it in front, the machine will not learn it alone. It’s still a machine. So that’s the point. There are existing ways of doing inclusive translations. And there could be inclusive ways I mean, to provide the more and more inclusive languages and translations I mean, in languages.

Andrej Zito

All right. Going back to our imaginary general process, we talked about last time about quality management, is that where the project is sort of finished for you? Or is there something else happening after quality?

Raffaele Pascale

Well, once we have several steps about quality as well, because we do not only, as I mentioned before, we do not only use external experience, I mean linguists checking those, but we also develop internally tools to double check that so there are archways and internal checks that we do after the archways, these l bars to you know, exclude the wherever human eye cannot see because machine are still helpful in any case, we need to admit it. So spaces, commerce in wrong places, you know, those tiny stuff that the human eye, it’s difficult to detect, because of the concentration of some there are some other reason. So, the quality, then we can say that, that the quality step is finished, when on our side is all on green lights, I mean, if anything, I am the if there is even a one thing that we are endowed with, we still have both these inequality steps to double check with the linguist with the LG word with the language leader because and thanks to the data collection teams, we learned even a lot of specific grammar stuff for each language because that you can not even imagine you go that down in your life. Because Wait, who should tell me that in India, there are you know, auxiliary verbs that in in English are not present wide in English should always be present, for example, but this comes only when you for example, have a labeling staff when you are teaching the machine. He’s ebz is the verb in English. And you see that in in Hindi, there are more than the one that so like, Hey, what’s happening here? What’s no no Don’t worry, it’s normal behavior. So this to say that the quality is checking for even those in depth staff to be sure that the quality is learning, you know, properly, what should the learning whatever language is needed? And once everything is done on our side, from the quality science side of you, we can consider it yes finished. I mean, finished from the management because we deliver to the client, then there is, you know, any client that about to any client feedback that you need to implement. In the best case scenario, yes, you deliver the client is happy and everyone around. That that’s what I wanted to ask because a few times during the during the chat, you mentioned that the idea of the whole data collection project is to make the machine learn. So let’s say you complete the project, how does the client or do you how do you evaluate how much the machine has learned? On the basis of the strings as I told you before the translations for example, as I told you before, the machine gives back the number of the strings learned. So I learned today one hundred srings, and if you are putting one hundred strings, okay, the length of the entire fight. That’s why also the client as constantly in this kind of checking to see if you are using any machine translation, because the machine tells you, if learnt or not on the basis of wherever, then you need to test the machine. And that’s why it’s a never ending, because once you put those strings, you can have then a second project called evaluation project. Now let’s evaluate what we did so far. Let’s see if the machine led properly. If not, we will come back to the initial one. And maybe let’s provide what the machine didn’t learn. For example, let’s this better, we have English into Spanish machine translation system, we are training, we put the five or the famous five strings inside, then we will test with a Spanish linguist, we will pay translate these but this is a second project if the client needs it. So if the client wants to test it, the clients extract the translations wherever not the same string, sir. So generate five new strings, five new sources pre translate with the machine translation that we prepare that we do with those five, let’s say previously, and prepare those translation pre translate the file, then we perform an evaluation step, we see if the machine translation, provided the good translations are not on the basis of the marks from one to six from one to 10, then you define, and we then define here, the machine deeding, the wasn’t the, for example, inconsistent in the same sentence. So we the same time it happens, I mean it up a lot of times when there is a sentence divided by a comma, for example, there is an emergency or we say this is, for example, the machine didn’t follow the fluency, then those results will be used to do a third project and teach the machine what the machine didn’t learn previously, but in the specific way. So the machine didn’t learn how to manage the context of the same term in the same sentence. let’s translate under sentences with that same term, let’s use an example. But we need to teach the machine how to translate the properly in that sense, because the other stuff we know that the machine learning, so each project will give us data to improve on the next step the machine. Right. Okay. But I wish I could I could build the machine. I mean, artificial intelligence system might I know, just banner ads? Yes, I just financer the work, the entire work behind that there could be many possibilities of emerging where the machine can even do.

Andrej Zito

Okay, that’s a little bit sci fi. So one thing that I wanted to just quickly ask is, I don’t know, what is your previous experience with translation or localization projects? But I would say in general, they’re very time sensitive, many times that people feel like the project managers feel they’re under pressure. Is it the same for data collection projects? Or no, because it’s a new field, and they’re like, say, no established? Let’s say turnaround times are productivities.

Raffaele Pascale

I need to say that in this sense, it depends on the client side, because there are clients with specific programming, they needed to be ready by that time. And now, let’s put in place a planner that we will be ready by then. Or there are clients asking you sending you the files today and saying that they needed yesterday.

Andrej Zito

So nothing changes, right? The clients are the same

Raffaele Pascale

Role without getting to and from the other. I mean, that’s what I want to specify when I started as Yella. As project manager, I was managing a normal definition projects, with the same clients as I’m used to manage them in during the time I saw that those clients followed me forever. And I’m happy about that. Because I was able to manage the same during the time. But before I was used to manage standardization project for automotive life signs it and I was on. Let’s say I was on a standard Translation Manager. But now I’m more on the artificial intelligence management that it’s even yeah, I mean, challenges. It’s my passion as well. So I like them or…

Andrej Zito

That’s good thing that you mentioned it because I was wondering, for the people who, let’s say want to make the transition from localization or translation into data collection. What do you think? would be, like, what was your most challenging thing, when you when you transitioned?

Raffaele Pascale

I can say that the most challenging part is the multitasking part, because it’s possible that you are managing data collection often come in multiple languages, when I say multiple is more than 10. So and when, when it comes to have more than one client with more than one project, it’s really easy to make the scopes, especially if they are super, you know, similar. So, you will really risk Apple makes the sort of the multitasking, first of all, because of the number and the volume of stuff that you manage, at the same time, on the basis of the knowledge meant, so, I should admit that linguistically speaking, a program if if a program manager is skilled, very skilled, also, linguistically, is not the point. Because in this sense, you can understand the scope and manage the project even better than just managing what should be done around the project, with the understandings that you have from another person, you know, you understood basically more or less worries, but not in the specific because the most specific, you know, the most, you can help your linguists during whatever kickoff call queries, and wait less on the client side, because the client, of course helps you to go through all the dabs all the queries, but the program manager should be able at least to minimize as much as possible, at least the issues and to minimize issues you need to know very well the scope. If you are not in in I can maybe suggest at all ask your alpha for modeling competition, a linguist that maybe can help linguistically and even with the management, or with the linguist and normally I rely, I mean, I always say that my LG, where are my savings lost a lot of the times because I have so many jobs, you know, why did you before in the delivery phase, especially, you put in jobs wherever because you want. So the lsus always help you to be to solve the linguistic issues, because those are the most where you will need the most assistance. And again, since the scope is the core, let’s say of the project, you need to understand very well that to build on the plan and the strategy. Otherwise, you can ask for it someone want to change from, let’s say standard translation project to data collection project, what I really can suggest is to start having an idea or worries natural language programming of neuro linguistic programming, because in this sense, we can understand the how the machine works to understand then how we can better our clients, at least on the management side of you, then linguistically, you will always have an alpha. But the management is the core it is the scope again of the project. Yeah. So if the management goes well, the project goes well.

We’re always creating new localization content

Make sure you don’t miss anything. Join 7470 other professionals on our mailing list and be the first to get our upcoming newsletter.

Data Collection for AI – Raffaele Pascale From Venga Global

We’re always creating new localization content

If you enjoyed that, you’ll love these…

Translation Memory (TM) 🎮

Computer-Assisted Translation (CAT) 🎮

Multi-Language Vendor (MLV) 🎮

Localization courses

Resources

Connect with us