In this video, I'll show how to start the SAS Developer trial
and launch a Jupyter notebook.
In the notebook, I'll use Python to invoke SAS data management
and analytics capabilities on SAS Viya.
As a Python developer, this environment
should be very comfortable to you.
To begin, from the Getting Started:
SAS Viya Developer page, click the Get
Started with Python, R, and SAS API via Jupyter Notebook link.
On the next page, click Start My Trial Now.
The folders shown here contain program examples broken out
by language: SAS, Python, and R. The examples
are here to help you start exploring this environment.
Although uploading your own data is not
enabled in this trial environment,
you have access to sample data for a variety of examples,
including banking, sales, and movie ratings.
Note that many of these examples are available on the SAS GitHub
page.
A link to these examples is on the main SAS Developer Trial
page, as well as here in the notebook.
Let's open a notebook and look at an example that
uses the hmeq data set.
This is a banking example in which
you will use SAS determine which cases are bad credit risks.
In this program, we import the Python packages,
create a session with the SAS Cloud Analytics Services
(or CAS) server, and then load the data and explore it.
To prepare the data, we impute the missing values
and partition the data into training and validation data
sets.
Then we build several models, assess them,
and compare the results.
Notice there's a handy link to the documentation.
Refer to the documentation to help you
understand the SAS Python APIs for the CAS actions.
A CAS action is the smallest unit
of work for the CAS server.
CAS actions are analogous to Python functions.
CAS actions are organized into groups
called action sets, which are analogous to Python packages.
Let's review and submit each code block
as we go through the program.
In the first code block, we load the Python packages
that are needed, as well as assign variables
that we need for our modeling.
Next, we start a CAS session and load
the action sets that we use in this program.
You need to load an action set before you can call the CAS
actions contained within it.
Here, we load the data into CAS.
Now let's explore the data.
We have 11 numeric variables, and our target variable
is a binary variable BAD, which indicates
whether a loan is good or bad.
Here we'll look at the descriptive statistics
of the numeric variables.
Next, we look at the cardinality,
or the number of distinct values,
and build a graph of the missingness.
The graph shows that we have missing values
for every variable.
In order to use the data effectively,
for some of our algorithms, we impute the missing values
using a variety of imputation methods
for different variables.
Now that we have a complete data set with no missing values,
we partition the data into training and validation data
sets using stratified sampling with respect
to our target variable.
In this table, we can see that we've
divided the data into a 70/30 split
within each level of the target variable BAD.
Next, we build a decision tree model.
We also score the current data now using this model
by calling the score action for the decision tree.
Because we've saved the model as "tree model" here,
we could score new observations at a later point
using this model as well.
We also run a SAS program to add columns to the data table
for predicted probabilities of each event, which
will be used for comparing with other models
later in this program.
We submit similar code for forest,
gradient boosting machine, and neural network models.
Note that you can edit the code to configure the modeling
algorithm options as desired.
For example, let's change the number of hidden neurons
here and run this code block.
Now we assess the models using the assess action.
We also define a Python function to make it easier
to assess all four of the models with the same code.
Having assessed all four models, we
can use the assessment statistics
to create a receiver operator characteristic, or ROC
plot and a Lift plot.
They're our standard graphs for visually depicting the accuracy
and effectiveness of a model.
First, we print the area under the ROC curve for each model.
The higher the area under the curve the better,
so we can see that the Forest model outperformed
the other three.
And now, let's draw the ROC curves and Lift plots.
The ROC plot also reflects that the Forest model is the best
with the highest curve.
On the Lift chart, we see that the Forest model
has a lift as high as 5 and at the first decile
the cumulative lift is just over 3.5.
It is also fairly consistent across
the different algorithms.
When you're done making calls to CAS actions in your session,
it's a best practice to close your session.
In a non-trial environment, closing the session
releases resources for others to use.
Here in the trial environment, you have access to the system
only for the duration of your session
and no information is saved on the system.
When you come back, you start with a fresh session.
You do have the option to download notebooks for use
at a later time.
In this video, we performed some basic modeling tasks
using Python.
We encourage you to expand upon the examples that
are provided in the trial and explore the environment.
For more information, please visit us at developer.sas.com.
Không có nhận xét nào:
Đăng nhận xét