Deploy to AWS with Ray's autoscaler

Prepare your login credentials of AWS CLI

Login to our AWS Classroom as introduced in serverless lab.

Click the button "AWS Details", and then click the "Show" button after "AWS CLI", you will get a string of text indicating the credential of your account.

Follow the instruction and copy the credentials to ~/.aws/credentials on your docker container.

As there is a session time limit for AWS education account, you need to redo the above steps if you cannot finish your works in 6 hours.

Prepare the Ray configuration file for deployment to AWS.

First, you need to get your IamInstanceProfile from AWS Console.

Step-1: Open AWS Console and Search IAM.

Step-2: Click "Roles" on the left navigation, and then click the "LabRole".

Step-3: After open the page for "LabRole", copy the Instance Profile ARNs. We will use it to config our ray cluster

Step-4: Go back to your docker container. In project folder~/ray-lab , create a new file aws.yaml with the following content. Remember to fill your Instance Profile ARNs got from last step into this file at the correct location (line-80 and line-99).

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 1

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 10

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
# docker: {}
docker:
    image: "rayproject/ray:latest"
    container_name: "ray_container"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes are currently spread between zones by a round-robin approach,
    # however this implementation detail should not be relied upon.
    availability_zone: us-east-1a,us-east-1b
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: False # If not present, the default is True.
    
    # to allow Ray dashboard being accessed from external 
    security_group:
        GroupName: ray_security_group
        IpPermissions:
            - FromPort: 443
              ToPort: 443
              IpProtocol: TCP
              IpRanges:
                  - CidrIp: 0.0.0.0/0
            - FromPort: 8265
              ToPort: 8265
              IpProtocol: TCP
              IpRanges:
                  - CidrIp: 0.0.0.0/0

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
head_node:
    InstanceType: m5.large
    ImageId: ami-073d2c3aa43ed04b4 # Deep Learning AMI

    # You can provision additional disk space with a conf as follows
    BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
              VolumeSize: 100

    # Additional options in the boto docs.
    IamInstanceProfile:
        Arn: {Your-Instance-Profile-Arn}

# Provider-specific config for worker nodes, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as SubnetId and KeyName.
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
worker_nodes:
    InstanceType: m5.large
    ImageId: ami-073d2c3aa43ed04b4 # Deep Learning AMI

    # Run workers on spot by default. Comment this out to use on-demand.
    InstanceMarketOptions:
        MarketType: spot
        # Additional options can be found in the boto docs, e.g.
        #   SpotOptions:
        #       MaxPrice: MAX_HOURLY_PRICE

    # Additional options in the boto docs.
    IamInstanceProfile:
        Arn: {Your-Instance-Profile-Arn}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # Uncomment the following line if you want to run the nightly version of ray (as opposed to the latest)
#    - sudo apt update -y
#    - sudo apt install -y python3-pip
#    - sudo pip install ray boto3
#    - sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 10

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Launch a cluster of Ubuntu (with Ray installed) on AWS EC2

To start the ray cluster on AWS, execute the following command in your docker container

ray up -y aws.yaml

After 3-5 minutes, you should see the following output that states that the cluster on EC2 is ready:

Shared connection to 3.94.118.22 closed.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 3.94.118.22 closed.
Local node IP: 172.31.34.140
2021-03-01 15:24:24,518 INFO services.py:1172 -- View the Ray dashboard at http://172.31.34.140:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.31.34.140:6379' --redis-password='5241590000000000'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop
Shared connection to 3.94.118.22 closed.
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /Users/cyliu/Documents/git/cslab/aws.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /Users/cyliu/Documents/git/cslab/aws.yaml
  Get a remote shell to the cluster manually:
    ssh -o IdentitiesOnly=yes -i /Users/cyliu/.ssh/ray-autoscaler_us-east-1.pem ubuntu@3.94.118.22

Submit your Ray distributed application from your docker container to the EC2 cluster

Before executing the program, open main.py, comment out line-79 and uncomment line-81.

Submit the program to the AWS:

ray submit aws.yaml main.py    

The output should be the same as running on your docker container.

Trigger Action 2 to increase the number of chatrooms.

Then the program will get blocked due to insufficient CPU cores to handle increased workload. The following warning message will be prompted.

2021-03-01 15:26:12,652 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff42867781e3b6e074ed26130701000000 cannot be scheduled right now. It requires {CPU: 0.500000} for placement, but this node only has remaining {0.000000/2.000000 CPU, 5.126953 GiB/5.126953 GiB memory, 1.513672 GiB/1.513672 GiB object_store_memory, 1.000000/1.000000 node:172.31.13.93}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

The program will be unblocked until a new AWS EC2 instance is spawned (~ 3-5 minutes).

Then, you can play the program again. E.g., triggerAction 1and selectuser4 a chat:

(pid=982, ip=172.31.47.122) I am user4. Now in Ray worker node `ip-172-31-47-122`, sending message: 'Hello world'
(pid=954, ip=172.31.47.122) Receive chat from Actor(UserActor,7dbf7d92a564823f475aa2be02000000), msg: Hello world forwarding to all group members
(pid=1781) I am user3. Now in Ray worker node `ip-172-31-10-238`, receive message: 'Hello world'
(pid=983, ip=172.31.47.122) I am user5. Now in Ray worker node `ip-172-31-47-122`, receive message: 'Hello world'

You can see that user3 and user4 are in the same Ray worker node while user5 is in the newly created Ray worker node.

Monitor your Ray cluster through the Ray dashboard

To get the public IP of your Ray dashboard, you have to do something different to get the dashboard web URL. Look at the last line of the output of ray up -y aws.yamlin the step above:

Get a remote shell to the cluster manually: ssh -o IdentitiesOnly=yes 
-i /Users/cyliu/.ssh/ray-autoscaler_us-east-1.pem ubuntu@3.94.118.22

The public IP of the Ray dashboard is 3.94.118.22

Open a browser, go to http://{public_ip}:8265. A list of the server (AWS EC2 instance) will be displayed. Click the '+' button beside the server.

All the allocated actors will be displayed.

Remove the EC2 instances after the lab

To save the AWS credit of you account, you should remove all EC2 instances. Execute this command in your docker container.

cd ~/ray-lab; ray down -y aws.yaml

Last updated