Trouble with predictive services

User 481 | 8/11/2015, 8:37:23 PM

We keep running into an issue where we create models and deploy them just fine, and the Predictive Service claims they exist in it's stored state in S3, but attempting to make requests to those models returns errors.

Curl Request:

curl -X POST -d '{"api_key": <api-key-string>, "data": {"data": {"users": ["a", "b", "c"], "n": 20}, "method": "recommend_n"}}' http://mj-ps-261054367.us-west-2.elb.amazonaws.com/data/tenant1-task-recommender

Response: {"info": "No predictive object has been registered with the name 'tenant1-task-recommender'", "message": "UnknownURI"}

Python List of Predictive Objects:

>>> ps = gl.deploy.predictive_service.load("s3://.../mj-ps")
>>> ps.deployed_predictive_objects.keys()
['tenant1-task-recommender']

State.INI:

[Service Info]
Name = mj-ps
Description = Primary deployment
API Key = <api-key>
CORS Origin = 
Global Cache State = enabled

[Environment Info]
log_path = s3://.../mj-ps_logs
environment type = Ec2PredictiveServiceEnvironment
region = us-west-2
load_balancer_dns_name = mj-ps.us-west-2.elb.amazonaws.com
certificate_is_self_signed
admin_key = <admin-key>
certificate_name

[Predictive Objects Service Versions]
tenant1-task-recommender = {"version": 2, "description": "tenant1", "schema_version": 3, "cache_state": "enabled"}

[Predictive Objects Docstrings]
tenant1-task-recommender = -- no docstring found in query function --

[Meta]
Revision Number = 5
Schema Version = 3

[System]
Cache TTL on update (sec.)
Cache max. memory (MB)

Additionally when attempting to query the predictive object in python, this happens:

>>> ps.query("tenant1-task-recommender", data={
    "n": 20,
    "users": ["a", "b", "c"]
})

Predictive Object 'tenant1-task-recommender' can not be found. If you just deployed the Predictive Object, it may take a short while for all Predictive Service nodes to be up-to-date. Please use get_predictive_objects_status() to get most current state.

>>> ps.get_predictive_objects_status()


Cannot get node status from i-a91fc76f, error: Cannot get status for host ec12-152-26-6-151.us-west-2.compute.amazonaws.com, error: (<requests.packages.urllib3.connectionpool.HTTPConnectionPool object at 0x10872cc90>, 'Connection to ec12-152-26-6-151.us-west-2.compute.amazonaws.com timed out. (connect timeout=10)')

We're a little stumped on this one.

Comments

User 1394 | 8/12/2015, 12:29:05 AM

Hey MSH -

Most likely there is a problem loading the model / Predictive Service object. What is the output of ps.get_status()?

You can test Predictive Services will be able to load the model successfully by trying the PredictiveService.test_query() API. The documentation for that API is here: https://dato.com/products/create/docs/generated/graphlab.deploy.PredictiveService.testquery.html#graphlab.deploy.PredictiveService.testquery

Is there a mismatch between the version of Dato Predictive Services and GraphLab Create (are you running 1.5.2 for both)?

Thanks,

Rajat


User 1174 | 8/12/2015, 12:39:03 AM

Hi,

Can you please check if the AWS instance (i-a91fc76f) is still running? Also, did you launch this Predictive Service with GraphLab-Create 1.5.2? Is the model/function that you have using GLC 1.5.2?

It seems like your model/function was able to be uploaded to S3; however, it did not load successfully into the Predictive Service on EC2. Can you also upload the log files (*server.log, *graphlabservice.log) from Predictive Service so we can help debug further? The logs file are located on S3 at 's3://.../mj-pslogs'.

Thanks.


User 481 | 8/13/2015, 9:44:14 PM

Problem still persists. Querying from predicitive service returns a model no found error like posted above. Testing with test_query fucntion returns recommendations. This does mean predictive service node cannot retrieve info from s3 where it does exists.

Posting server log

2015-08-13T21:13:04,{"INFO": "Created new log file", "logfile": "/tmp/2015-08-13T21-13-04.ip-172-31-38-34server.log"} 2015-08-13T21:13:04,{"fromfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34server.log", "to": "s3://mjdatasvc/predictionsvc/deployments/mjdata-ps/logs/2015-08-13T20-58-04.ip-172-31-38-34server.log", "INFO": "Shipping log file"} 2015-08-13T21:13:04,{"INFO": "Rotating log file", "logfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34query.log"} 2015-08-13T21:13:04,{"INFO": "Created new log file", "logfile": "/tmp/2015-08-13T21-13-04.ip-172-31-38-34query.log"} 2015-08-13T21:13:04,{"fromfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34query.log", "to": "s3://mjdatasvc/predictionsvc/deployments/mjdata-ps/logs/2015-08-13T20-58-04.ip-172-31-38-34query.log", "INFO": "Shipping log file"} 2015-08-13T21:13:04,{"INFO": "Rotating log file", "logfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34feedback.log"} 2015-08-13T21:13:04,{"INFO": "Created new log file", "logfile": "/tmp/2015-08-13T21-13-04.ip-172-31-38-34feedback.log"} 2015-08-13T21:13:04,{"fromfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34feedback.log", "to": "s3://mjdatasvc/predictionsvc/deployments/mjdata-ps/logs/2015-08-13T20-58-04.ip-172-31-38-34feedback.log", "INFO": "Shipping log file"} 2015-08-13T21:13:04,{"INFO": "Rotating log file", "logfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34result.log"} 2015-08-13T21:13:04,{"INFO": "Created new log file", "logfile": "/tmp/2015-08-13T21-13-04.ip-172-31-38-34result.log"} 2015-08-13T21:13:04,{"fromfile": "/tmp/2015-08-13T20-58-04.ip-172-31-38-34result.log", "to": "s3://mjdatasvc/predictionsvc/deployments/mjdata-ps/logs/2015-08-13T20-58-04.ip-172-31-38-34result.log", "INFO": "Shipping log file"} 2015-08-13T21:13:04,{"INFO": "Finished rotating and shipping logs"} 2015-08-13T21:13:04,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:13:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:14:03,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:14:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:14:33,{"INFO": "Updating Predictive Service"} 2015-08-13T21:15:03,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:15:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:16:03,{"INFO": "Updating Predictive Service"} 2015-08-13T21:16:04,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:16:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:17:03,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:17:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:17:33,{"INFO": "Updating Predictive Service"} 2015-08-13T21:18:03,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:18:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:19:03,{"INFO": "Updating Predictive Service"} 2015-08-13T21:19:04,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:19:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:20:03,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:20:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:20:33,{"INFO": "Updating Predictive Service"} 2015-08-13T21:21:03,{"INFO": "Publishing metrics to CloudWatch"} 2015-08-13T21:21:04,{"INFO": "Metrics submitted to CloudWatch"} 2015-08-13T21:21:31,{"po_name": "crowdcast-dev-tenant1-task-recommender", "INFO": "Data plane called with POST"} 2015-08-13T21:21:31,{"WARNING": "No cache; calling GraphLabService directly"} 2015-08-13T21:21:31,{"INFO": "Querying GraphLabService", "query": {"uri": "crowdcast-dev-tenant1-task-recommender", "params": {"data": {"users": [1, 2, 3], "n": 20}}}} 2015-08-13T21:21:31,{"INFO": "Submitting query to GraphHTTP/1.1 200 OK Transfer-Encoding: chunked Date: Thu, 21 Jul 2016 23:13:36 GMT Server: Warp/3.2.6 Content-Type: application/json

016A ["37zyefqi2sweveyp","42fn7zeo6v5ui427","66pt5sk2wz2jrbzu","awoljknjigytdyls","cj2lanoogknwopto","cnm3adnh35xmsx3f","ebxs4t2y6xr5izzy","eg5zus2pz72mr7xb","exshwew2w2jv3n7r","hxrxgzvgms3incmf","hymu5oh2f5ctk5jr","jkisbjnul226jria","lag7djeljbjng6bu","o3l65o4qzcxs327j","qsk2jzo2zh523r24","t7k6g7fkndoggutd","xfllvjyax4inadxh","ygtjzi2wkfonj3z7","yycjajwpguyno4je"] 0


User 1174 | 8/13/2015, 10:31:58 PM

Hi MSH,

can you also provide the graphlab_service.log? and can you paste the response when you call ps.get_status()?


User 481 | 8/13/2015, 11:14:05 PM

2015-08-13T21:13:03,{"INFO": "Created new log file", "logfile": "/tmp/2015-08-13T21-13-03.ip-172-31-38-34graphlabservice.log"} 2015-08-13T21:13:03,{"fromfile": "/tmp/2015-08-13T20-58-03.ip-172-31-38-34graphlabservice.log", "to": "s3://mjdatasvc/predictionsvc/deployments/mjdata-ps/logs/2015-08-13T20-58-03.ip-172-31-38-34graphlabservice.log", "INFO": "Shipping log file"} 2015-08-13T21:13:04,{"DEBUG": "Received request", "request": {"type": "CountObjects"}} 2015-08-13T21:13:04,{"DEBUG": "Successfully sent response"} 2015-08-13T21:13:04,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:13:04,{"DEBUG": "Successfully sent response"} 2015-08-13T21:13:15,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:13:15,{"DEBUG": "Successfully sent response"} 2015-08-13T21:13:24,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:13:24,{"DEBUG": "Successfully sent response"} 2015-08-13T21:13:35,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:13:35,{"DEBUG": "Successfully sent response"} 2015-08-13T21:13:44,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:13:44,{"DEBUG": "Successfully sent response"} 2015-08-13T21:13:55,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:13:55,{"DEBUG": "Successfully sent response"} ... ... ... 2015-08-13T21:27:04,{"DEBUG": "Received request", "request": {"type": "CountObjects"}} 2015-08-13T21:27:04,{"DEBUG": "Successfully sent response"} 2015-08-13T21:27:04,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:27:04,{"DEBUG": "Successfully sent response"} 2015-08-13T21:27:15,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:27:15,{"DEBUG": "Successfully sent response"} 2015-08-13T21:27:24,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:27:24,{"DEBUG": "Successfully sent response"} 2015-08-13T21:27:35,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:27:35,{"DEBUG": "Successfully sent response"} 2015-08-13T21:27:44,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:27:44,{"DEBUG": "Successfully sent response"} 2015-08-13T21:27:55,{"DEBUG": "Received request", "request": {"type": "CheckHealth"}} 2015-08-13T21:27:55,{"DEBUG": "Successfully sent response"} 2015-08-13T21:28:03,{"INFO": "Rotating log file", "logfile": "/tmp/2015-08-13T21-13-03.ip-172-31-38-34graphlab_service.log"}

Response of ps.get_status()

[ERROR] Cannot get node status from i-6eed4ca8, error: Cannot get status for host ec2-52-27-168-36.us-west-2.compute.amazonaws.com, error: (<requests.packages.urllib3.connectionpool.HTTPConnectionPool object at 0x1106bebd0>, 'Connection to ec2-52-27-168-36.us-west-2.compute.amazonaws.com timed out. (connect timeout=10)') [{'models': None, 'dns_name': u'ec2-52-27-168-36.us-west-2.compute.amazonaws.com', 'cache': None, 'state': u'InService', 'reason': u'N/A', 'id': u'i-6eed4ca8'}]


User 481 | 8/14/2015, 4:36:01 PM

Also,

$ps.getpredictiveobjects_status() Columns: name str expected version int node.i-6eed4ca8 str

Rows: 2

Data: +-------------------------------+------------------+--------------------+ | name | expected version | node.i-6eed4ca8 | +-------------------------------+------------------+--------------------+ | crowdcast-dev-tenant1-idea... | 5 | 0 (Failed to load) | | crowdcast-dev-tenant1-task... | 11 | 0 (Failed to load) | +-------------------------------+------------------+--------------------+ [2 rows x 3 columns]

Predictive objects are failing to load on the predicitve service. Any idea why?


User 1178 | 8/14/2015, 5:05:10 PM

Hi,

Normally the load failure could be caused by:

  1. Your customer Predictive Object depends on some other packages that is not available in your Predictive Service
  2. Your customer Predictive Object depends on some other files (say you have your own *.py files that your function depends on)
  3. Your customer Predictive Object requires a lot of memory and it uses up all memory that is available in the system.

We have improvement in 1.5.2 Predictive Service that would help you diagnose the load failure better. Do you mind upgrading GLC to 1.5.2 and redeploy your Predictive Service? That way we have better idea of what goes wrong when loading the Predictive Service. If you decide to do so, please do not forget to terminate your current Predictive Service (EC2 instances cost money!).

Sorry for all the inconvenience and we are working hard to make Predictive Service better!

Ping


User 481 | 8/17/2015, 7:12:40 PM

Hi Ping, thanks for the reply. Our custom predicitive object does use some other packages which were written by us. We are already using glc 1.5.2. Can you tell us how can we install python packages on the predicitive service ec2 instance once it is created? Thanks


User 1178 | 8/20/2015, 5:38:11 PM

Hi,

You may use the required_packages decorator for your Predictive Service. Check out user guides here.

@graphlab.deploy.required_packages(['packagname1==package-version1', 'package-name2==package-version2'])
def my_custom_func(args):
  ...

It is using pip package naming convention.

If on the other hand, you are using other files, you may use required_files decorator:

@graphlab.deploy.required_files(['file1.py', 'file2.py'])
def my_custom_func(args):
  ...

Let me know how it goes.

Thanks! Ping


User 1262 | 8/21/2015, 2:50:29 PM

Hi Ping, You mentioned that by using GLC 1.5.2 we can have a better idea of what goes wrong when loading the Predictive Object. I'm already using 1.5.2 Where can I find the object loading logs so I can assess the problem?

Thanks!


User 1262 | 8/21/2015, 4:17:06 PM

Found the error in the server log. The problem seems to be lack of space on the tmp folder. is there any way I can clear some space and/or understand why so much space is being used? (are previous versions being stored there? can they be deleted?)


User 15 | 8/21/2015, 6:15:26 PM

Hi @akrumholz,

Could you start a new thread for this topic so we can track it better? Thank you.