Java + Cloud + AI

Monday, September 14, 2020

AWS useful articles

Lambda with RDS

https://www.jeremydaly.com/reuse-database-connections-aws-lambda/

https://www.jeremydaly.com/manage-rds-connections-aws-lambda/

Good part:

database connection management recommendation

Added Value:

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-proxy.html

(This seems like a DB connection pool management)

Wednesday, July 22, 2020

This is not about the detailed comparison between SQL and NOSQL. There are a lot of articles online already regarding this topic.
This is about my experience with SQL and NOSQL.
I primarily used Oracle, Mysql and some DB2. And I am mainly using Aurora SQL database and DynamoDB NOSQL nowadays.
What I learned:
Even though a lot of online talks mentioned using one single table approach with DynamoDB for microservices, it is not easy to do, especially if a microservice has relatively complex domain and goes through a lot of changes.

DynamoDB case insensitive search is not straightforward; it takes extra effort compared to SQL. When data needs to be encrypted and searched, be careful about the data management (for example, use lower case, use hash).
DynamoDB throttling can be a pain depending on how you manage it, or whether your company is willing to pay more to avoid throttling.
DynamoDB global secondary indexes could become expensive if you have too many.
Some DynamoDB tricks mentioned online, for example, using some fake hashkey to enforce unique key constraints, using the same column to store many different types of data, could become a nightmare for application maintenance. Not many developers will be able to understand the code easily.
DynamoDB transaction apis could provide great value to developers that come from the SQL world. But the transaction APIs are more expensive and relatively slow.
Even with DynamoDB, you may still need to manage some kind of relationships between entities, which basically is similar to what you would do with SQL DB.
DynamoDB is really a dummy map. Some things you normally can do with SQL DB, for example, create timestamp, SEQUENCE, becomes a burden in application code.
DynamoDB stream is actually a great feature. But when integrating with lambdas, some duplication processing could happen in some cases.
SQL database is not bad when it comes to handling large volume. Facebook mysql use is a good example. It takes good design, tuning, and maybe even customization.

Friday, December 1, 2017

Machine Learning

Machine Learning

Purpose
Machine Learning is to generalize.

Classic Problem
Normal Programming: "Hello world"
Machine Learning: MNIST

Approach
Problems --> Tools--->Metrics (apply to all problems?)
Data to generalize --> Use different algorithms --> Monitor performance of algorithms and adjust

Key Words
Classification
Discrete output
Regression
Continuous numeric output
Clustering

Gradient descent, Backpropagation, Cost function,
Cross-entropy
Any loss consisting of a negative log-likelihood between the empirical distribution
defined by the training set and the probability distribution defined by model. For example,
Mean Squared Error: cross-entropy between empirical distribution and a Gaussian model

Activation function
Step function
discrete 0, 1
Sigmoid function

Tanh function

Rectified Linear function (ReLU)

Exponential linear unit (ELU)

Training data set
Train parameter
Validation data set
Tune Hyperparameter
Test data set

Bias, Variance
Linked to capacity, underfitting, overfitting

Closed-form solution

Parameter
Learned
Weight, Bias
HyperParameter
Tuned
Learning rate
number of layers
number of nueons each layer
number of iterations

Accuracy
Sensitivity
Specificity
F1-score

Kernel trick
Maximum likelihood estimation
Point estimate of variables

Bayesian estimation
Full distribution of variables

Optimization
Hill Climbing
One step along axis one time
Achieve Optimal solution for Convex problem
Problems: local maxima, ridges and alleys, plateau
Good for function complex and/or not differentiable

Gradient Descent
Vanishing/exploding gradients problems
approaches to solve: He initialization, Batch Normalization

Momentum

AdaGrad

RMSProp

Adam

Regularization
Modification to ML algorithms, intending to reduce generalization error, not training error
Example: weight decay for linear regression
Early stoppping, L1, L2, Dropout, Max-Norm, Data Augmentation

Generalize
To have small gap between training error and test error
Supervised Learning
features + labels
Nonprobabilistic SL
K-Nearest Neighbor
Decision Tree
Unsupervised Learning
features without labels
Reinforcement Learning
Learning by getting feedback from the environment

Transformer:
Modify or filter data before feeding it to learning algorithms
Preprocessing
Feature selection
Feature extraction
Dimension reduction (PCA, manifold learning)
Kernel approximation

Cross-validation schemes
K-fold
Stratified K-fold
Leave-one-out (small amount of data)

Dimension Reduction
PCA
KPCA
LLE

Math behind ML
  z=wx+b
σ(z)=1/(1+e−z)
...

Concepts
Model--Train--Evaluate--Predict
Classification, Regression, Clustering, Dimension deduction

Algorithms
Linear Regression
Find optimal weights by solving normal equations

Logistic Regression
No closed-form solution. Maximizing the log-likelihood, or minimizing the negative log-likelihood using gradient descent.

Neural Network
RNN (Recurrent Neural Network)
CNN (Convolutional Neural Network)

Decision Tree

Identification Tree

Naive Bayes
Features independent of each other
Conditional Probability Model
Highly scalable, only requires small amount of training data
Linear Performance Time
Generally outperformed by other algorithms, SVM...

Support Vector Machines
For both classification and regression
Widest street to separate instances of different classes

Random Forest
Decision Tree ensemble

Test Methodologies
Leave one out LOO
for small amount of data

Data split (80/20)

Practical Guidelines for DNN
Initialization He
Activation ELU
Normalization Batch Normalization
Regularization Dropout
Optimizer Adam
Learning Rate Schedule None

Software
Tensorflow, Scikit-learn
Spark MLLib, Spark ML, Weka,

Use cases
Linear Regression
House size---> House price in a community

Naive Bayes
Document classification: separate legitimate emails from spam emails
For example, based on key words: cheap, free

Questions
When to use which algorithm(s)?

Classic Applications
Alphago vs Lee Sedol
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Autonomous Car

Netflix movie recommendations

Image recognitions

Natural language processing

Summary
No ML algorithm is universally better than any other algorithm.
Understand data distribution, and pick proper algorithm(s).

References

Books
(One of my favorite books, highly recommended)
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

  Algorithms for Reinforcement Learning

Deep Learning (Adaptive Computation and Machine Learning series)

http://neuralnetworksanddeeplearning.com/

  TensorFlow

  scikit-learn

  https://www.kaggle.com/

  Reinforcement Learning - David Silver

http://www.wildml.com/

  https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/

Machine learning series from Luis Serrano (Very good explanations for beginners)
  https://www.youtube.com/watch?v=aDW44NPhNw0
  https://www.youtube.com/watch?v=BR9h47Jtqyw&t=24s
  https://www.youtube.com/watch?v=2-Ol7ZB0MmU&t=7s
  https://www.youtube.com/watch?v=IpGxLWOIZy4

  http://scikit-learn.org/stable/tutorial/machine_learning_map/

  https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf

  https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

(AWS machine learning service)
  https://aws.amazon.com/blogs/aws/sagemaker/

(Spark MLlib example)
https://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf

  https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/

https://biomedical-engineering-online.biomedcentral.com/articles/10.1186/s12938-017-0378-z

  https://iknowfirst.com/rsar-machine-learning-trading-stock-market-and-chaos

  Mastering the game of Go without human knowledge

Tuesday, January 31, 2017

String valueOf() pitfalls

What will the console output of this program?

public class TestStringValueOf {

public static void main(String[] args) {
testStringValueOfChar();
}

public static void testStringValueOfChar() {
char a = 'a';
String str1 = String.valueOf(a);
String str2 = String.valueOf(a);
System.out.println("char comparison:" + (str1 == str2));

double d = 12.3d;
String str3 = String.valueOf(d);
String str4 = String.valueOf(d);
System.out.println("double comparison:" + (str3 == str4));

boolean b = false;
String str5 = String.valueOf(b);
String str6 = String.valueOf(b);
System.out.println("boolean comparison:" + (str5 == str6));

Object o = null;
String str7 = String.valueOf(o);
String str8 = String.valueOf(o);
System.out.println("Object null comparison:" + (str7 == str8));

Object notNull = new Object();
String str9 = String.valueOf(notNull);
String str10 = String.valueOf(notNull);
System.out.println("Object Not null comparison:" + (str9 == str10));
}
}

see the end of this article for the output.

Overall, the string comparison should use 'equals' no matter how String objects were created.

-------console output----------

char comparison:false
double comparison:false
boolean comparison:true
Object null comparison:true
Object Not null comparison:false

Monday, July 18, 2016

Spring MVC UTF-8

Key points

web.xml:
<filter>
<filter-name>encodingFilter</filter-name>
<filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>encodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

Maven pom.xml:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
...
</properties>

JSP:
<%@ page language="java" pageEncoding="UTF-8"%>
<%@ page contentType="text/html;charset=UTF-8" %>

Friday, June 17, 2016

Compile xsl files and store in cache to improve XSLT performance

Common code found online to do XSLT transformation. (removed non essential pieces for brevity)

------------------

TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(new StreamSource(new File(xsltPath)));
transformer.transform(new StreamSource(new File(sourceFilePath)), new StreamResult(new File(resultPath)));

----------------

The code works. But if a xslt file is relatively big and

needs to be used over and over again to transform

a lot of files, for example, in the batch mode,

it may not perform well.

The following shows a way to cache the compiled version of an xsl file, which is a 'Templates' object. This object is thread safe.

Code snippet to cache the 'Templates' object.

static final Map<String, Templates> cacheTemplates = new ConcurrentHashMap<String, Templates>();

static TransformerFactory transformFactory = null;

static {

init();

}

private static void init() {

try {

transformFactory =TransformerFactory.newInstance();

}

catch(Exception e) {

throw new RuntimeException(e);

}

public static void cacheCompiled( String xsl) {
File file = null;
StreamSource source= null;
Templates templates = null;
try {
file = new File( xsl);
source = new StreamSource(file);
templates = transformFactory.newTemplates(source); //create this once for a file, save in a cache.
cacheTemplates .put(xsl, templates );
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
}
}

The above 'templates' object is basically a coompiled version of the original xsl file. If the original file is relatively big, for example, 20KB, it takes more than 2 seconds on my local machine to transform a small file. Without caching the templates, it takes more than 2 seconds every time. With caching, it takes about 0.1 seconds for every transformation after the first time.

The basic code is like this:

//get the Templates object from cache based on the xsl file name, then get a Transformer object

Transformer transformer = templates.newTransformer();

transformer.transform(new StreamSource(new File(sourceFilePath)),
new StreamResult(new File(resultPath)));

The 'transformer' object mentioned above is not thread safe.

The SAXON parser seems becoming more popular, and the Xalan parser seems fading away.

The home edition of the SAXON parser, which is free, may be good enough for a lot of applications.

Friday, November 15, 2013

First impressions on open source ESBs

Used commercial ESB and BPMs for a couple of years, recently had a chance to evaluate some open source ESBs.

WSO2: not easy to use, had difficulty even making the sample projects to work. No DataMapper tool, which is a big no-no to my projects.

Mulesoft ESB: Nice documentation, instructions easy to follow, sample projects can be built and run in a couple of minutes, nice DataMapper tool in the 3.4 version. Have not had a chance to build a relatively complex application using this. Not sure whether the community edition is good enough to be used in the Production.