edit

Deep Text Corrector

Deep Text Corrector is an Tensorflow project made by Alex Paino for correcting grammatical errors in short sentences. For example, the message "I'm going to store" would be unaffected by typical autocorrection systems, when the user most likely intendend to write "I'm going to the store".

In this guide we will train a Tensorflow model for correcting sentences and use it to evaluate input sentences. Finally we will deploy the trained model as a REST endpoint that can be used to evaluate input sequences in real time.

Project setup

The code for this project is available on Floyd's Github page. Clone the project and initialize a floyd project.

$ git clone https://github.com/floydhub/deep-text-corrector
$ cd deep-text-corrector
$ floyd init deep-text-corrector

Training

Dataset

For this project we will use the Cornel Movie-Dialogs Corpus for training and testing. The dataset should be preprocessed and split into 3 sets: 80% for training, and 10% each for validation and testing. This preprocessed dataset is available publicly on FloydHub.

Training

You can train the deep corrector model by running correct_text.py script with required parameters. Below is the command to start a training job on Floyd:

$ floyd run --gpu --env tensorflow-0.12:py2 --data floydhub/datasets/deep-text-corrector/1:input "python correct_text.py --num_steps 1000 --train_path /input/data/movie_dialog_train.txt --val_path /input/data/movie_dialog_val.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --output_path /output"

Notes:

  • The input dataset is passed using the --data parameter. This mounts the pre-processed Cornell Movie Dialog dataset at /input path. You will notice that other parameters use files mounted in this path.
  • The data name floydhub/datasets/deep-text-corrector/1 points to the pre-processed dataset on FloydHub.
  • The job is running on a gpu instance (Because of the --gpu flag).
  • This project uses Tensorflow-0.12 installed on Python 2. (See the --env flag)

This job takes about 10 minutes to run and generate a model. You can follow along the progress by using the logs command.

$ floyd logs <JOB_NAME> -t

Floyd saves any content stored in the /output directory after the job is finished. This output can be used as a datasource in the next project. To get the name of the output generated by your job use the info command.

$ floyd info <JOB_NAME>

Evaluating

To evaluate your model you can run the correct_text.py script with the decode flag. You need a file containing short messages for evaluation. The test.txt file already has some inputs. You can update or add more strings to this file - one per line. You also need to use the output from the training step above as the datasource in this step.

floyd run --env tensorflow-0.12:py2 --data <REPLACE_WITH_JOB_OUTPUT_NAME>:input "python correct_text.py --train_path /input/data/movie_dialog_train.txt --test_path test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --input_path /input --decode"

You can track the status of the run with the status or logs command. The logs should print the concerted messages from the test.txt file.

$ floyd status <JOB_NAME>
$ floyd logs <JOB_NAME> -t

Improving your model

You may notice that the output does not look great. In fact, the algorithm would've added more mistakes into the sentences than correct it. That is because we ran the training for a small number of iterations. To train a fully working model try the training step again, this time by setting the flag num_steps to a large value. In general, about 20000 steps are necessary to give a working corrector model. (Note: This takes a few hours to run on the GPU instance)

Evaluate pre-trained models

If you want to try out a pre-trained model, FloydHub has a public job output for this. You can mount it with job output name: floydhub/deep-text-corrector/23/output .

floyd run --env tensorflow-0.12:py2 --data floydhub/deep-text-corrector/23/output:input "python correct_text.py --train_path /input/data/movie_dialog_train.txt --test_path test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --input_path /input --decode"

This model should perform better on the given inputs compared to the previous one.

Serve model through REST API

FloydHub supports seving mode for demo and testing purpose. If you run a job with --mode serve flag, FloydHub will run the app.py file in your project and attach it to a dynamic service endpoint:

floyd run --mode serve --env tensorflow-0.12:py2 --data floydhub/deep-text-corrector/23/output:input

The above command will print out a service endpoint for this job in your terminal console.

The service endpoint will take couple minutes to become ready. Once it's up, you can interact with the model by sending text you want to correct:

curl -X POST -d 'I see it tomorrow' <REPLACE_WITH_YOUR_SERVICE_ENDPOINT>

Any job running in serving mode will stay up until it reaches maximum runtime. So once you are done testing, remember to shutdown the job.

Note that this feature is in preview mode and is not production ready yet

What Next?

The model was trained using movie dialogues which are not the greatest sources of gramatically correct sentences. An improvement to this approach would be to use other datasources like Project Gutenberg. This project was also discussed on HackerNews and you can find lots of interesting alternatives there.


Help make this document better

This guide, as well as the rest of our docs, are open-source and available on GitHub. We welcome your contributions.