Registered Member
|
Hello Fellow KDE Users and Forum Members (and hopefully simon dev ?),
For 3 weeks now i try nearly daily to get Simon (http://www.simon-listens.org/) to work. But with no success. The main problem is, that simon does not recognize anything im saying. (i did over 150 trainings of the simple word firefox and it did nothing) I hope this will be my last post in a forum for help to get this to work. So i will be extra detailed. Problem: Simon does not work, more specifically it does not recognize/trigger commands when i say the trained word. It is not even possible to tell if simon is listening (except from the activated button) no sign of partial detection or activity is seen. There are no errors in the process of model creation. My system: Gentoo Linux on amd64 with KDE 4.11.2 If more Info is needed please say so. Audio conditions Mic is in webcam and works flawlessly with Skype/Audacity Sound server is pulseaudio. low background noise is present from PC itself (7% of mic volume) Speaker (My voice) volume is around 50-70 % normaly Installed software simon 4.1 julius 4.2.2 shpinx3 0.8 pocketsphinx 0.8 SphinxTrain 1.08 htk-3.4.1 with hdecode Desired use cases I want to control my PC with voice commands Language I'm from germany and i would like to do above in german. I want to use the new german acustic model as a base, if possible, if it is not possible for any reason i would build my own accoustic model. Speech model stuff I want to use the sphinx backend. (selected in ksimond) I also want to use the data from voxforge for the german base speech model. Simon settings: default, except SNR is set to 150/500/200 (tried all three but with no success) so i can record with my mic without warning of bad sample Cutoff level, tried 2000 and 1000 but there is no explanation what these numbers mean or stand for! I used sox to noise filter my background noise and it worked really well, in postprocessing commands, but removed it again as i think when i speak "live" it could maybe confuse simon, who is then trained on clear samples. What i found out: - The old simon model from voxforge is for HTK (and not working anymore with the new simon4.1), not Sphinx. But i tried it anyway (again some conversion by hand) -> Result: not working - The new model is not compatible out of the box with simon. - There seems to be no working german base model out there who is usable out of the box with simon. - The base model does not contain a shadow dict. - There is no howto available how to set up the new version of simon 4.1 correctly, who also has some troubleshooting section or anything. I will write one up if we succeed here What i did till now (basic approach) Nearly everything First i downloaded the new german model from http://goofy.zamia.org/voxforge/de/ (voxforge-de-r20140311.tgz) Then i extracted the parts necessary to build a .sbm archive and wrote a little xml file for it. The files are: feat.params feature_transform mdef means metadata.xml mixture_weights noisedict transition_matrices variances TARed all this up and loaded it into simon. This is now available here also: http://kde-files.org/content/show.php/% ... 604f1dd260 So as far as i understand, simon only works if the phoneons used to make the base model are also used in the shadow dict and the vocabulary so i imported the voxforge.dict who was in the /voxforge-de-r20140311/etc/ folder. I used the "sphinx" format to import it. I hope this is the right one. So now i got the new german voxforge sphinx model and the dict. right out of the archive. Set up a new scenario and trained the word firefox.(150 times) (This word was already in the dict with the "right" pronunciation). Good news: sphinx is compiling the new model with the training data and there is no error All looks good an set up. Try to speak firefox into the mic: Nothing happens!!!! Second approach Purged all simon configs and set up the generated base model again, then I downloaded the scenario "Fensterverwaltung" (Window management) and try to train and use this instead. Also no success. Nothing is recognized. But model builds fine... Additional i get some red vocables in the vocable window, which means ( i read in a german howto) the pronunciation is not the right one from my base model. This should go away with training the word, well trained 34 times and did not change or anything so i gave up. Third approach Purged all simon configs and then Tried to train simon without base model, also again, only the word firefox. As said before this did not work. Here i used the same voxforge.dict out of the german base model but did not import the speech model into simon. Some hopeless/out of despair approaches - tried different dictionaries (ralf herzog, julius (from the new german model)) Also tried to import them in different formats - tried different scenarios - tried ENglish HTK model, also no success. - mixed some approaches - tried them countless times - trained over 100 MB of data... - I installed htk and julius as a last resort measurement but did not work. My questions - Biggest question of all: HOW DO I GET IT TO WORK (with the new german model from above, if possible)????? Some other things that interest me: - I read that htk is needed for adapted models and without model, but does not sphinx provide this functionality if i use sphinx as a training backend ? - What about the noise filtering with sox, is it better to train with it, or not ? In this post processing command also applied to "live" speaking(when simon is activated)? - How can i find out, if i use the right dict for the base model? - Why did it even not work, when i used no base model ? - Do HTK/Julius and Sphinx collide if they are installed ? - As simon is a "frontend" for Sphinx, is it possible to use sphinx to see if sphinx does recognize anything ? I could only find documentation how to train it but not how to use it. Other means of help: If someone really knows his way around simon and has an idea why it is not working here or just want so check my system and specs et, i would consider a TeamViewer session and allow access to my pc (monitored by me) Notes: I did set up Simon grammar and program tabs and all that so it should work in theory!!! There was one time there was a recognition of the "Erkennungskontrolle" but i never could repeat this success, even i trained 120 times more it did not work. Also this only happed once in my countless tries so far. For clearness i did not use screen shots, if needed i will be happy to provide If you need more information, please don t hesitate to ask me for it. Where i also discussed this problem Maybe there is a specific info in these threads who is not in here - but i doubt it. http://voxforge.org/home/forums/other-l ... 6j9-SV5Blw http://voxforge.org/home/forums/message ... Je6AgjmM5Q THANKS A 1000 TIMES |
Registered Member
|
Hey guys
I played around some more and found some things out: I trained the words with simon, exported the model and tried it out with pocketsphix. Result: pocketsphinx hast nearly 100 % recognition rate. For me this means the following: - Training with simon works and the changes are actually reflected in the accustic model. - Only the actual recognition does not work, (maybe more but there is no: last recognized line.... so this is the base of the not working part i guess) - So a misconfigured Simon is more likely the reason. (Maybe?) - It would also mean that Simon does use sphinx and not pocketphinx to recognize my speech right ? - Also found out, that if I add a new word (not in the dict, and not in the language model i adapt) it is somehow processed, however not added to the lm file and so pocketsphinx can't recognize it. I think it has something to do with the senddump file but i cant make sense of it. -It also means, that if simon does not "want" my speech i could use any other program to control my PC with. It looks like a bug in simon or something..... But i dont know any other expect from blather but i don't think this would work with the german model. Do you know any of those programs ? Or why/how i maybe have mis configured simon ? Note: I also saw that in ksimond there are some programs missing in the shpinx backend for eg. bw:, map_adapt and some more but i don't know if this is a problem or not Thanks |
Moderator
|
Hey there,
Simon developer here. Don't worry we'll get it to work Okay, first of all: There is a new German Voxforge model? That is awesome! I haden't even known about this... Great job to whoever did this! So now to your problem(s). It appears that there is something fundamental (which means it should also be quite trivial to debug so don't worry) wrong with your installation. But because you tried so many different things to get it working I am a bit confused about your current configuration. But before I go into details, a few general things: 1. You are right when you say that words marked in red are disabled because of incompatible transcriptions. However, you can *not* train those through model adaption. What really happens is that your base model does not have phonetic coverage of the transcription of these words (likely because of a different transcription style) but adaption can only *adapt* existing models (triphone representations) not add new models (new triphones). In essence that means that you need to change the transcription to align with your base model or use a wholly user generated model to support such words. The new model looks quite cool but it still has the massive speaker bias because almost all the data is only from a handful of speakers. If you do adaption / use a user-generated model you will probably be better off. Regardless, this really doesn't have anything to do with the problem that Simon doesn't appear to recognize anything but I just thought I would point it out. The HTK is needed to adapt HTK / Julius models, not for adapting SPHINX base models. If you remember where you read something on the contrary, please point it out, I'd like to fix that. Noise filtering with Sox is, as you rightly thought yourself, mostly counter productive - *especially* if you don't do the same for the decoding (where setting it up is slightly trickier because it needs to be plugged into the audio stream). I would strongly recommend against it. A better approach is to do this at the pulseaudio level if noise turns out to be a problem. There are a few plugins for noise reduction, the most popular of which (which you probably even have installed already) is their slightly confusingly named "echo cancellation" plugin. If you want, you can try it out: http://userbase.kde.org/Simon/Tips,_Tri ... ncellation If you do, I would suggest that you use a custom model (no base model) and re-record all training data through the filter. Again, I would only do that after you have a running system and are experiencing problems. The question of having the "right" dictionary for the base model is not very straight forward to answer as the information basically disappears during training (it's not really possible to reconstruct the dictionary used to compile the hmm definitions). For this reason, the documentation (description) of the base model usually includes that information / the name. The new Voxforge model appears to be using the Voxforge German dictionary but if you want to be absolutely sure, I recommend getting in touch with the guys that built it. HTK / Julius and SPHINX do not collide, they can safely be installed in parallel. Simon uses pocketsphinx (or rather libpocketsphinx) to decode audio when using the SPHINX backend.
This has me a bit worried: Are you ultimately trying to use an N-Gram for decoding? That isn't really supported right now. So after we got all this out of the way, let's look at the problem. I want you to turn on "Keep recognition samples" in the Simond settings (KDE's System Settings > Accessibility > Simond). Then restart Simon if it was running. Now say a couple of commands and quit Simon. Next, open ~/.kde4/share/apps/simond/models/default/recognitionsamples. Are there recordings in there? If yes, would you mind mailing them to me? (peter ate grasch ` net) We'll go from there. Best regards, Peter |
Registered Member
|
Hello Peter
Really, really nice to hear from you. Thanks for answering my questions in such detail. First the mega good news: Simon works now! But i'm a bit scared that i can break it again The bad news, this kinda happed accidentally and i can't reproduce how I did get it to work. I don't know why it did work it just simply started to work when i tried it out again. As always I was trying out many things: - I installed cmuclmtk-0.7 but this was not a requirement so i don't know if it fixed anything. It certainly did not work right after install, because then i checked. But maybe after restart. - I was googeling how to use pocketsphinx directly and did export the trained lang model from simon and used that data directly with pocketsphinx (pocketsphinx_continuous). - I also started sam and tried to get some information about the lang model specifically what dict it would use. However sam is not build for providing this info and i played a bit around but with no intent and as far as i can say with no imminent result. - I synced with the Server (simond), i never did that before - I changed the lowercase Triggerword to Uppercase like it was in the vocabulary, So these match 100 % (The vocabulary was in Uppercase before (from the dict import) and the triger words where in lowercase) - I reduced my mic input vol to 95 % with pavucontrol - I added more scenarios. More specifically the "Erkennungskontrolle" (Recognition Control) scenario. Here i have the question ? What does it do ? This is not really well documented. I noticed that the GUI did change after that, there was a submenu "Filter" with a checkbox beside it somewhere (i noticed this the day before yesterday). Also the Erkennungskontrolle seems to do something with filter but what is it exactly ? However I removed the scenario again and i have the impression that the GUI has changed, because i can't find the Filter Submenu anymore. Also when i first had the "Erkennungskontrolle" added i think this was the only time it did recognize "something" but no commands. Till now I can still use simon... So i can't pinpoint what it made to work. I hope it stays that way! Sorry My other Questions - What is the "Cutoff"Level in Voice Activity detection ? In what is it measured? - How can i train/adapt the pocketphinx model with a adapted model from Simon for use in my raspberry pi ? Is is even possible as they are two formats i think ? - Does pocketsphinx support something like your awesome minimum SNR level ? - There are some sphinx programs who are not found ? bw, map_adapt, mllr_solve, Are they needed ? Side notes: Peter I really appreciate your work you do. For the disabled community but also in general as not many people seem to be interested in working on simon or accessibility programs for Linux. This is really important work you do! I Hope you will continue to work on simon. Also very much Thank you for the good support here in the forum! I would like to gift you something on Paypal/Amazon Gift card. Do you have a special adress for this ? I also help with the German model, im reading a book from gutenberg:-D HTK Confusion I found the passage here: "If you want to train or adapt HTK acoustic models, you also need the HTK. If you don't know what that means, you can safely skip this whole step. " It was not clear to me that sphinx would not be used to train a htk model, as somewhere else is stated that both can be used if i make a custom model. So for me this read like: If you want to train or adapt TECHNICAL_TERM_WHO DOES_NOT_BE_IMPORTANT_AS_BOTH_CAN_BE _USED acoustic models, you also need the HTK. So I installed it, as i thought Shpinx will maybe use also the hidden markov model to make sense out of my speech. In fact this seems logic, but it seems that they get the model from somewhere else or use another one. Anyway here is the link: http://userbase.kde.org/Simon/Installation Thank you very much for Simon and your support! |
Moderator
|
Hi there,
hehe, don't worry so much about breaking it. Simon actually includes a rollback feature. If you mess something up, you can go to Settings > Configure Simon > Synchronization and select an earlier system state to roll back to. Again, I'm gonna address your points individually: * cmulclmtk has nothing to do with your sudden success. * I am pretty sure you *did* in fact synchronize with Simond before, it happens automatically after every change in the model input data per default. * The casing is actually important. If that didn't match, no commands will ever be executed as the command system is case sensitive. However, you should have still seen the command in the Simon main window (like: "FOO BAR (no associated command)"). Reducing the microphone volume is also a good idea - around 7 % background noise is actually *a lot*. You may have a better time with only a few percent (like 1 to 2) and lower "signal" (like 20 %). If your background noise is too loud, it will confuse the voice activity detection which will result in major problems during decoding (if Simon can not identify where your command ends, no decoding will ever take place as Simon will keep recording, waiting for you to "stop talking"). * The "Erkennungskontrolle" scenario adds "Erkennung pausieren" (iirc) that activates the filter and an equivalent command for deactivating it. Please refer to the documentation of the filter command for details. Basically, if this is the first scenario in your list, you can use it to activate / deactivate Simon with voice commands. * The cutoff level is the level below which Simon considers the recording to be "silent". This is important for the noise activity detection as this is based around distinguishing noise from spoken commands. There is no dimension, this really is an absolute maximum value that Simon is looking for in the audio frame. Think of it as the highest point in the audio wave for every frame. The value range is that of a 16bit signed integer. If you have a lot of background noise, it may make sense if you increase that, but normally you can just leave it alone. * If you load your pocketsphinx model as a base model, check the "adaption" box and train some samples, Simon will adapt it for you. The current adaption procedure for SPHINX is not 100 % perfect, though. If you want to get your hands dirty, you can possibly squeeze some more accuracy out of your data set if you adapt it manually (it's not that hard). Or you could help us improve the automatic adaption procedure * The pocket sphinx command line tool has it's own SNR calculation, yes. But I'm not quite sure what exactly you are referring to here. What are you looking for exactly? * You will need those if you want to do SPHINX model adaption. For that you will need to install a current version of SphinxTrain. Regarding the HTK confusion: Well, this is a problem then . Yes SPHINX also uses HMM Models, but the HTK is just one software suite to deal with HMM's, SPHINX implements comparable sphinx (in SphinxTrain, SphinxBase). The actual fact is, that you'll need the HTK to adapt HMM's in the HTK format - in other words "HTK acoustic models". If you have a better phrasing to make this more clear, please let me know, or better yet: edit the wiki article. Thanks! About the side notes: Thank you very much for your kind words. I appreciate this. If you do want to help the Simon effort, you are doing the most important thing already and that is contributing to better speech models. But if you do also want to contribute some money, please consider "joining the game": As I'm sure you know, Simon is a KDE project and a lot of it would not be possible without the continued support of the wider KDE community. By joining the game, you can become a supporting member of the KDE e.V. and thereby contribute to the budget that allows the KDE e.V. to organize the yearly Akademy conference and a wide variety of development sprints. You can find more information about this here: http://jointhegame.kde.org Best regards, Peter |
Registered Member
|
Hello Peter,
Thanks for the quick reply However, you should have still seen the command in the Simon main window (like: "FOO BAR (no associated command)"). -- This I did not see I also did not changes to my audio setup. Only thing i did change, was i removed the tshed=0 option from the pulseaudio config so that audacity would work perfect but skype does not now. But that is another story. This could maybe be it but on the other hand i had this removed for some time and it did still not work. For me this is a bit of a mystery I think it would be a good idea to make the dict, always lowercase or match the voc and the trigger words in case style, as with the dict. There are imported Uppercase and normally people write lowercase or mixed one but not all uppercase, so this will hinder a proper running of simon right from the start. With the pocketsphinx training i mean the following. As written before i added a word who was not in the dict and the pronunciation was not in the lang model (TUX) of the new german modes. Simon was able to detect the new word without problems. I then exported the so adapted model and extracted it to use it with pocketphinx on my raspery pi (no gui installed so i cant use simon there). There pocket sphinx was not able to recognize TUX, however, words who are available in the dict did profit from the training with simon. So my question is: How to train completely new words with simon and make them available in a model/Form that pocketsphinx can understand, as the exported model seems to be not enough ? (Wild guess, the new words are in the sendfile and this is in simon live incoperated while in pocketshinx it must be precompiled in ?) My idea with SNR was: I could use simon, to control my media player while music is playing, as such, the music is "noise" somewhere between level 1-2 when i now speak with level 6 and set SNR to 200 it should pick up my voice and ignore the the noisy music in the background and so i would be able to control the music player while it is playing. As this whole setup should go to my pi someday it is important for me how to set the SNR directly with pocketpshinx so i can use it on the pi. I also noticed that my command i'm using at the moment (Computer Firefox) has some unintended side effects. First of all in Gammer is defined: T P , Computer is set as T and Firefox as P, So of what i read this should behave like: Computer Firefox ->> FF starts, Computer -> nothing, Firefox -> nothing (as trigger/program word is not present). But: 1. Firefox starts even when i just say Firefox without computer in front of it 2. It even starts up when i cough in a certain way 3. Also sometimes just from the sound of keystrokes I have to say i use the German model as base and i did not much training yet. Only 28 times, so i guess 2 and 3 will be better if i do more training but number one troubles me the most. Also can i use Simon in this manner: Computer Firefox, so that Firefox is labeled as program and then later in the same scneario also as trigger word for eg: Firefox new Tab, where Firefox is the new trigger now, or does this confuse Simon? Also another question: is the recorded volume of any importance ? As you suggested i could turn my mic down but then the recorded speech volume is quite low. Does this influence decoding badly ? Do you think it is worth to try out the svn version of simon ? Are there any new killer features ? (I read already that with the diction feature simon will record anything, this is also something i look for, for like: google "HERE IS SOMETEXT" ) Does this actually exist ? Like give the recorded speech to a preset command ? I could not find this in the actual dialog setting. Found a bug: Also i get this behavior when i activate power training: In my new Scenario: Mypaint, I created a training text: Computer Mypaint, When i now klick on next in the training field it should record and record till i press next, and then record the next sentence etc. But it does not, It starts recording and after some seconds it switches directly to the next recording without wait. This happens so fast, that i can't speak in the sentence so fast and it keeps recording bad samples and it looks like it is in a hurry. For better reproducibility here is my config any everything. Just try to make the training and activate power training Sometimes it works nearly through the whole training so maybe more than one tries are needed. http://www.sendspace.com/file/okeu21 Also it would be good if it would be possible to bind some grammar structure to another condition, like active window. For example: I have the structure Objekt Aktion, and it it is called "Ebene löschen" so it will execute a shortcut, however this will every time execute the shortcut when it recognizes these words, even when the mypaint window is not active and i my just say these words to someone else. I saw the "Context" tab and i could add it but then the whole scenario would only work on the Mypaint window. So i would need another scenario just to start mypaint with my voice, as "Start mypaint" is not recognized because there is no mypaint window yet. Or maybe add an exception to the context window so some commands can be executed without the context condition met. BTW: I will think of "The game" but for now as a student such costs are a bit to high for me. Thanks a lot Manu |
Moderator
|
For the future: The recognition results are displayed in the lower right of the four boxes. Honestly, I also have no idea what else could have been the problem but if your cases mismatched that certainly is a problem that would absolutely prevent Simon from working.
The dictionary is not upper cased during import. But most dictionaries are all uppercase by nature - if you get one with proper casing, it will be properly cased in Simon as well. In any case, writing scenarios is something that I would consider power-user territory (by definition) so matching case is something I think we can expect people to do. I did try to make problems with this apparent when I added the label to the commands view showing what would need to be said to trigger the command and added a list of the exact sentences that Simon is able to recognize (in the Examples tab in the Grammar section). If Simon starts to apply casing automatically, the system will become much less predictable (there is definitively value in keeping the casing of the dictionary) and I do not necessarily want to make all commands case insensitive. Maybe an indicator in the command section would be nice when we realize that we could not possibly recognize a command but even that is tricky because for example hierarchical scenarios have no idea what grammar structures their parents will provide so they can't tell if something can be recognized or not. Regarding the "pocketsphinx training": If you build your own user generated model, you will have no problem decoding "TUX". If you use an adapted model it will not and can never work. You ran into what is probably a bit of a confusing feature with Simon when you are trying to see what works and what doesn't: Simon will try to provide the user a working decoding system whenever possible. This is motivated in the fact that we eventually expect people that are dependent on voice input to use it where a broken system is a big problem. Because of that, Simon will keep working models while it *tries* to apply new changes. If something fails or goes wrong, it will fall back on the old (existing model). This will still produce an error message if something actually *goes wrong*. However, there is a preprocessing step that can lead to subtle, silent "failures". We call this preprocessing step the "model adapter". This is where we try to catch (and potentially correct) common problems with your input data. One such problem is that there is insufficient data to build a model. For example: You set up Simon to use a static base model and do not have any training data. Then switch to a user-generated model. Obviously, there is no model to build so Simon will silently keep using the base model until you provide it with training samples. Again, we do this to ensure continuity of service. Of course, in this case it appears obvious that this is what would happen. However, when you for example have words whichs transcription is not phonetically covered by the base model you are trying to adapt, then these words (and associated samples) are *removed* from the training corpus before adaption begins. If this is all you have, you suddenly have no training data and Simon falls back to the static base model as there is nothing else it can do. I think this is what might have happened for you. The model that you export from Simon's configuration screen is the current active model without context modifications applied. (i.e. the full, unrestricted domain that results from combining all selected scenarios, ignoring all restrictive context conditions) About controlling Simon while music is playing: You will almost certainly need an external microphone for this to work. Even then, I strongly recommend to use echo cancellation *and* to train or at least adapt your own acoustic model to account for the echo cancellation filter. There really is no SNR to set here, Pocketsphinx calibrates automatically. That you get unintentional invocations is pretty much expected if you have a very small domain. A slightly unconventional but quite effective "cure" is to add words that do nothing to give the decoder something to "recognize" in these cases. A proper solution will require some techniques during acoustic modeling (generating the base model). Adding "Firefox new Tab" etc. is not a problem. The recorded volume is of course important but unless it is extremely low, you'll be fine. You should optimize your volume in such a way that you *never* get any clipping (even when you raise your voice, for example). If you set it up like that (which will likely mean that when talking normally you'l hit around 20 - 25 % in Simon's display) and have too much background noise then please seriously consider getting a better mic. You can of course always try Simon's git version but at this point there is little upside compared to 0.4.1. The dictation feature is still under development. What you describe here is not a bug. This is what the power training is supposed to do. It uses Simon's voice activity detection to detect periods of "silence". It works the same as decoding and it's not time triggered but rather advances when it thinks you are done speaking ("long" pause detected). If it cuts you off mid-command, that means you make a longer pause than Simon expects and this will be a problem during decoding itself (as the decoder will get your sentence split into two parts because of the "long" pause in-between). I suggest you raise the length of what Simon considers a "long" pause. By default this is 350 ms. To change it, open Simon's configuration and go to Recordings > Voice Activity Detection and change "Tail margin". Try, for example, 800 ms or even a full second. Try it out, this is actually a feature I really love: It makes training extremely quick and, as you never have to touch mouse or keyboard throughout, the recordings won't have any background noise of you clicking buttons. About binding grammar structures to other conditions: This is why scenarios have hierarchy. Context conditions are transitive. If you want to do it very cleanly, you could think about this: * MyPaint - MyPaintStart (Contains Start, Context = not(running(mypaint)) - MyPaintControl (Contains some general commands for mypaing, Context = running(mypaint)) - ToolOptionA (Contains some specific stuff for when the tool options for tool A are currently open, Context = isActiveWindow(mypaint - tool option A) Of course, you could even think of writing a context plugin that tells you more about mypaints internal state. If you think the Join the Game initiative is too expensive (I get it), you can still donate however much you feel appropriate on KDE's general donation page: http://www.kde.org/community/donations/ Thank you in advance. Best regards, Peter |
Registered Member
|
Hello Peter,
Very much thank you for all of this information. Your support is really great ! Unfortunately I don't understand all of it fully.
What do you mean by that ? Do you mean a small count of samples to learn from ? Could this be reduced if i record more samples ?
What do you mean exactly ? Should I: - ONLY Say something like: "Do Something now computer firefox" where only Computer and firefox is in the grammar/vocab structure and would start firefox? - Or also add: Do, something, now into the vocab. ? Does this also mean it must be reflected in the grammar structure like: Nonsens Nonsens Trigger Porgramm ? Do i need to train them ?
How would this work? I'm in contact with the guy who builds the german model, i could give him the info if you want Another part I did not understand was: Why is "Mypaint" working and starts mypaint if in the grammar structure is clearly stated that only Computer Mypaint should work? How do i get it to behave like this ? If you already explained it, im really sorry i totally did not understand:-( Tipp: I think it should be started here http://docs.kde.org/development/en/extr ... ration_vad that this only applies to power traing, as i thought this was the general way of sample recognition. Thanks again for such good support and program |
Moderator
|
Hi,
what I mean is this: The decoding can be thought of selecting the statistically most likely hypothesis based on the present evidence (the features extracted from your recording). If you have a very limited space of candidates (because your application grammar is so small), the decoder will be tempted to be "certain" about its results even if the evidence is quite lacking. For example: Say you have a setup that ought to only recognize "Computer Internet". Then you say *something* (which may be something entirely different than "Computer Internet") and pass that to the decoder. The search space for the recognition now is: <Silence> (obviously wrong) and "Computer Internet". Which one of these is more likely? Well it's not hard to figure out what your recognition result will be. (There is always the option that the model probabilities are all so low that the search fails, yielding no result, but this is quite rare if the decoder finds even just a little bit of evidence in favor of one of the options.) So what you usually do is to add models of stuff that you are expressly not interested in. For example, better acoustic models have their own noise markers for coughs, flicks, etc. which is great for systems where you are basically interested in everything that is being said except for obvious noise (dictation, for example). But for pure command & control systems, it often make sense to add a "garbage" model reflecting "average" speech. The idea is that the decoder *has* to associate a hypothesis for a recording so you give him a model that is more likely than the stuff you are looking for when something else than your commands are being said. Often this is done by simply changing every n-th word in the training corpus with a new word called "+GARBAGE+" (or similar) that is transcribed as a single "GARBAGE" phoneme. The resulting garbage model then naturally reflects "average" speech (like an average of all the sounds you'd expect from a person talking) and is a good alternative to "Computer Internet" when, for example, you say "The weather is nice". What I meant with adding more words is to just add some random other sentence to your vocabulary / grammar that the decoder can choose when you are not saying Computer Internet. The idea is the same, but that works without modifying the underlying model - it just won't work quite as well. The question why "Mypaint" works is likely the same - even as you say "Mypaint", it will recognize "Computer Mypaint" as that is the only thing that your grammar allows. About your tip: It's not only for power training. Power training and the normal recognition share the same configuration settings. And you're welcome Best regards, Peter |
Registered Member
|
Hello Peter,
Thank your for this really helpful info I just did a quick test with these settings: It works quite well Could i do something better ? I got one last question about echo cancellation: in the wiki-page is written that the module from pulseaudio does only take the sound who comes from the speakers and subtracts it from the input sound of the mic. This could work in favor of my music idea. However i got background noise who is NOT coming trough my speakers. (It is the sound of the pc coolers). Does this then have any effect ? Could i somehow supply my sox noise profile into the audio stream ? (Not only the one used for training but the "live" feed ?) Thanks for your patience |
Moderator
|
No, that looks quite good within the boundaries of what's easily doable.
Yes, echo cancellation cancels noise from a known noise source but the pulseaudio plugin actually does a bit more as far as I am aware (or at least it did when it was initially introduced), it is just called echo cancellation because that is it's primary objective. It's nothing too major, though, so if the noise from your coolers is making problems, you'll need to try something else. So first question: are your coolers really making problems? If so: What kind of cooling do you have there? oO In any case, you'll want to do this in Pulseaudio. I'd recommend you look around, maybe you can find an implementation of a noise cancellation plugin for it. Julius, btw, does have a spectral substraction plugin in it's front-end. But your first choice here should really be to do this in Pulseaudio, in my opinion. Best regards, Peter |
Registered Member
|
Hello Peter,
Thank you for answering all my questions. I hope simon will now continue to work. If not I would write here again I guess. Also another tip, it would be good if there would be a category "Simon Dictionaries" on kde-files.org. As I would upload the german dict where the newest model was built and i don't want to upload it in scenarios or something. So people would get the right one. I dont know if you are the maintainer of the simon category on KDE-Files tho Thans again for all your kind words and keep up the good work! Manuel |
Moderator
|
You're welcome.
Having a dictionary category on f.k.o seems like a good idea! The way this works is to simply request the category over at http://opendesktop.org/feedback (say where you want it to be places, what it will be used for, etc.). You can mention my name if you think you need extra credibility but it's not like I or anyone else has any special rights for the "Simon" category on kde-files.org (a subsidiary of opendesktop.org). Best regards, Peter |
Registered Member
|
Hi Peter,
I wrote to opendesktop.org if it gets accepted i will upload the corresponding dict. Another question did pop in my mind: I saw on youtube one guy who did say: Answer: how far from mars to earth. Answer was the trigger word for a Google search for this. And then google opened with that serach sting and search results, you could even let that be read aloud to you.... Could this be made with simon ? Like that the dialog plugin puts the recognized word into a script or textfile. This could also have some other uses i guess. This textfile/script could make use of the words then. I know that for this to work we would need a good german model (but it is actually not that bad! ) and many words in the active directory. For this i'm thinking, it would be good to make an extra active directory who is only available to the dictation plug-in, so other scenarios get not too messed up when you want do so something and it sounds too similar to other words etc. So this big dict would be only available for the decoder when you dictate something. And the the results should be written to a file or given to a program or something. How doable is this ? Could this be made available in the next release ? Thanks a lot |
Moderator
|
What you are talking about here is called dictation. We are currently working on integrating this into Simon but this is still in progress. To answer your question: Not now, but in the future.
It's not as easy as putting more words in the active dictionary, btw. But needs a slightly different approach than the grammar setup we have in Simon now. I worked on a prototype last summer, please see the posts starting with "Open Source Dictation" on my blog for more details: http://grasch.net Best regards, Peter |
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell