Deep Text Corrector
Deep Text Corrector is an Tensorflow project made by Alex Paino for correcting grammatical errors in short sentences. For example, the message "I'm going to store" would be unaffected by typical autocorrection systems, when the user most likely intendend to write "I'm going to the store".
In this guide we will train a Tensorflow model for correcting sentences and use it to evaluate input sentences. Finally we will deploy the trained model as a REST endpoint that can be used to evaluate input sequences in real time.
Project setup¶
The code for this project is available on Floyd's Github page. Clone the project and initialize a floyd project.
$ git clone https://github.com/floydhub/deep-text-corrector
$ cd deep-text-corrector
$ floyd init deep-text-corrector
Training¶
Dataset¶
For this project we will use the Cornel Movie-Dialogs Corpus for training and testing. The dataset should be preprocessed and split into 3 sets: 80% for training, and 10% each for validation and testing. This preprocessed dataset is available publicly on FloydHub.
Training¶
You can train the deep corrector model by running correct_text.py
script with required
parameters. Below is the command to start a training job on Floyd:
$ floyd run --gpu --env tensorflow-0.12:py2 --data floydhub/datasets/deep-text-corrector/1:input "python correct_text.py --num_steps 1000 --train_path /input/data/movie_dialog_train.txt --val_path /input/data/movie_dialog_val.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --output_path /output"
Notes:
- The input dataset is passed using the
--data
parameter. This mounts the pre-processed Cornell Movie Dialog dataset at/input
path. You will notice that other parameters use files mounted in this path. - The data name floydhub/datasets/deep-text-corrector/1 points to the pre-processed dataset on FloydHub.
- The job is running on a gpu instance (Because of the
--gpu
flag). - This project uses Tensorflow-0.12 installed on Python 2. (See the
--env
flag)
This job takes about 10 minutes to run and generate a model. You can follow along the progress by using the logs command.
$ floyd logs <JOB_NAME> -t
Floyd saves any content stored in the /output
directory after the job is
finished. This output can be used as a datasource in the next project. To get
the name of the output generated by your job use the
info command.
$ floyd info <JOB_NAME>
Evaluating¶
To evaluate your model you can run the correct_text.py
script with the decode
flag.
You need a file containing short messages for evaluation. The test.txt
file already has some
inputs. You can update or add more strings to this file - one per line. You also need to
use the output from the training step above as the datasource in this step.
floyd run --env tensorflow-0.12:py2 --data <REPLACE_WITH_JOB_OUTPUT_NAME>:input "python correct_text.py --train_path /input/data/movie_dialog_train.txt --test_path test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --input_path /input --decode"
You can track the status of the run with the status or logs command. The logs should print the concerted messages from the test.txt file.
$ floyd status <JOB_NAME> $ floyd logs <JOB_NAME> -t
Improving your model¶
You may notice that the output does not look great. In fact, the algorithm would've added more
mistakes into the sentences than correct it. That is because we ran the training for a small number
of iterations. To train a fully working model try the training step again, this time by setting
the flag num_steps
to a large value. In general, about 20000 steps are necessary to give a
working corrector model. (Note: This takes a few hours to run on the GPU instance)
Evaluate pre-trained models¶
If you want to try out a pre-trained model, FloydHub has a public job output for this. You can mount it with job output name: floydhub/deep-text-corrector/23/output .
floyd run --env tensorflow-0.12:py2 --data floydhub/deep-text-corrector/23/output:input "python correct_text.py --train_path /input/data/movie_dialog_train.txt --test_path test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --input_path /input --decode"
This model should perform better on the given inputs compared to the previous one.
Serve model through REST API¶
FloydHub supports seving mode for demo and testing purpose. If you run a job
with --mode serve
flag, FloydHub will run the app.py
file in your project
and attach it to a dynamic service endpoint:
floyd run --mode serve --env tensorflow-0.12:py2 --data floydhub/deep-text-corrector/23/output:input
The above command will print out a service endpoint for this job in your terminal console.
The service endpoint will take couple minutes to become ready. Once it's up, you can interact with the model by sending text you want to correct:
curl -X POST -d 'I see it tomorrow' <REPLACE_WITH_YOUR_SERVICE_ENDPOINT>
Any job running in serving mode will stay up until it reaches maximum runtime. So once you are done testing, remember to shutdown the job.
Note that this feature is in preview mode and is not production ready yet
What Next?¶
The model was trained using movie dialogues which are not the greatest sources of gramatically correct sentences. An improvement to this approach would be to use other datasources like Project Gutenberg. This project was also discussed on HackerNews and you can find lots of interesting alternatives there.
Help make this document better¶
This guide, as well as the rest of our docs, are open-source and available on GitHub. We welcome your contributions.
- Suggest an edit to this page (by clicking the edit icon at the top next to the title).
- Open an issue about this page to report a problem.