[Previous: Implement a Genie Skill Backend] [Next: Deploy Your Genie Assistant]
Once a skill is created and tested, we can start synthesizing data and train a model for it.
If you have loaded the example skills and do not intend to synthesize your own dataset and train your own models, you can skip this page and try deploying the agent.
Parameter datasets provide real-world realistic values for Genie to synthesize natural sentences. We have prepared a list of parameter datasets commonly used in natural languages, such as ISO 4217 currency codes and ISO 639-1 language codes, and randomly sampled short/long strings as fallback datasets for parameters with no dedicated data.
Run the following to download the prepared datasets from the workdir
:
make parameter-datasets.tsv
If your skill defines new entities, or you have parameters of string type with domain-specific values, you can append your own parameter datasets to the index parameter-datasets.tsv
. It is a tab-separated file, where each column specifies parameter type (either string
or entity
), language code, entity name/string dataset name, and path to the parameter dataset file, respectively.
Entity name should have your skill name followed by :
as a prefix, whereas the string dataset name should be exactly the same as what you put into string_values
annotations, which will be introduced later.
For parameters of Entity type, where an entity id is needed to process the parameter, developers need to supply a json
file containing all values with their value in code and string in natural language.
An example of entity parameter datasets with one value for cuisine
in Yelp skill is shown below. For each value, value
and name
contain the value in code and string in natural language, respectively. The canonical
field contains the tokenized name.
{
"result": "ok",
"data": [
{
"value": "hotpot",
"name": "Hot Pot",
"canonical": "hot pot"
}
]
}
For parameters of both String type and Entity type, developers can specify string values with annotation. The syntax is #[string_values=<dataset-name>]
.
For Example, in the Yelp skill, we collect restaurant names from OpenStreetMap and declare the id
parameter with string_values
anntoation as follows:
out id: Entity(com.yelp:restaurant)
#[string_values="org.openstreetmap:restaurant"]
The string parameter dataset is in tsv
format, where the first column contains the example values and the optional second column contains the weight of the corresponding values.
You can choose any of the existing datasets from the String Datasets page, or use your own, tailored to your device. If your device must understand values from an open ontology (that is, you expect the user to try out values not seen at training time), it is recommended to include at least 10,000 to 100,000 training examples.
Note that example values are necessary for both input and output parameters since an output parameter can also be used in the command as a filter.
First, make sure you have activated the installed Python virtual environment. If not, run
source ../.virtualenv/genie/bin/activate
A standard-sized dataset takes about 2 hours on a machine with at least 8 cores
and at least 30GB of RAM.
To test your setup, you can generate a small dataset with:
make subdatasets=1 target_pruning_size=25 max_turns=2 debug_level=2 datadir
When you are ready to generate the full dataset, do this:
make datadir
The generated dataset will be in datadir
directory.
Within the Python virtual environment, run the following command to train a model:
make model=$YOUR_MODEL_NAME train-user
Set model
to a unique identifier of the model. By default, the model is called "1".
Training takes about 7 hours on a single V100 GPU.
The model is saved in everything/models/$YOUR_MODEL_NAME
.
[Previous: Implement a Genie Skill Backend] [Next: Deploy Your Genie Assistant]