Data annotation

This tutorial is based on the roles defined by DecentralML:

  • Model creator

  • Data annotator

  • Model engineer

  • Model contributor

and the corresponding task explained in Decentralised Machine Learning documentation. Here's a summary:

  • Data annotation

  • Model definition and restructuring

  • Model training

For each of this task we present a tutorial of all the roles involved and the corresponding functions that need to be executed. These functions are part of the python substrate-client.

All the tasks involve the model creator for creating the task, the required files and assets, and for validating the results.

This tutorial relies on separated scripts and files as assets to complete the machine learning tasks created by the model creator. The structure of the assets will be indicated for each machine learning task.

Data annotation

This is the task of annotating data that will be used for training of machine learning. In this example, we describe the data annotation task for object detection.

Asset files

Assets files are require for this task. You can find some example assets at the path substrate-client-decentralml/assets/data_annotator.

├── annotation_files
│   ├── file_1.jpg
│   └── file_2.jpg
├── annotation_samples
│   ├── sample_1.jpg
│   └── sample_2.jpg
└── start_task.sh
  • annotation_files are the files that must be annotated

  • annotation_samples are the sample's images tha the annotator would have to find and label in the annotation_files.

  • start_task.sh is a script that the model_creator must create for the data_annotator to actually execute the task.

Procedure

  1. The model_creator create a task for data annotators using the function:

    # decentralml/create_task.py
    def create_task_data_annotator(expiration_block, substrate, sudoaccount, passphrase, task_type, question, pays_amount, max_assignments, validation_strategy, annotation_type, annotation_media_samples, annotation_files, annotation_class_labels, annotation_class_coordinates, annotation_json, annotation_files_storage_type, annotation_files_storage_credentials)

    In which:

    • annotation_type specifies the kind of annotation. In this case object_detection

    • annotation_media_samples is the list of samples for the annotators to detect in the images.

    • annotation_files is the list of files to be annotated by finding the samples in them.

    • annotation_class_labels is the list of classes to be annotated in the files. At least one annotation_media_sample must be provided for each label.

    • annotation_json can be include additional info about the annotation, or be used to specify the task execution script (i.e. start_task.sh from assets)

    For additional info on the substrate parameters (i.e. expiration block, substrate, etc.) consult the documentation of the python client or view the example (https://github.com/livetreetech/DecentralML/blob/main/substrate-client-decentralml/src/decentralml/create_task.py).

  2. The data_annotator then can list_task (see Listing objects) and accept a task with:

    #decentralml/assign_task.py
    def assign_task(substrate, sudoaccount, passphrase, task_id)

    by specifying the task_id. Assigning a task will download the corresponding assets for data annotation task.

  3. The data_annotator can then start the task by executing the script created by the model_creator:

    ./start_task.sh

    The model_creator is responsible for creating the data annotation procedure. The outputs of the annotation must be saved in a separate folder, for example, using the same assets folder, creating an output folder:

    ├── annotation_files
    │   ├── file_1.jpg
    │   └── file_2.jpg
    ├── annotation_samples
    │   ├── sample_1.jpg
    │   └── sample_2.jpg
    ├── outputs
    │   ├── file_1_annotations.json
    │   └── file_2_annotations.json
    └── start_task.sh
  4. Once the data annotation task is completed, the data_annotator can send the results using:

    #decentralml/send_task_result.py
    def send_task_result(substrate, keypair, submission_id, result, result_path, result_storage_type, result_storage_credentials)

    This function accepts a parameter result_path which will have to be set to the output folder of the annotation task (i.e. outputs from the assets folder).

  5. The model_creator can list the available results for each task using the list_task_results (see Listing objects).

  6. Once, a result is available, the model_creator can start validating the results using the validate_task_results. The validation of the results can be performed according to three policies:

    • AutoAccept: the results are automatically accepted

    • ManualAccept: the model_creator manually accepts each task results

    • CustomAccept: the model_creator can implement custom methods for automatically validating the results.

  7. Once the results are validated, the model_creator or the automatic validation procedure can either accept or reject the results, using respectively accept_task_results() or reject_task_results().

    Accepting the results issues the payment to the contributor.

Last updated