Monday, September 14, 2020

AWS useful articles

 Lambda with RDS 

     https://www.jeremydaly.com/reuse-database-connections-aws-lambda/

     https://www.jeremydaly.com/manage-rds-connections-aws-lambda/

 Good part:

     database connection management recommendation

 Added Value:

    https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-proxy.html

   (This seems like a DB connection pool management)

Wednesday, July 22, 2020

SQL vs NOSQL (DynamoDB)

This is not about the detailed comparison between SQL and NOSQL.  There are a lot of articles online already regarding this topic.
 This is about my experience with SQL and NOSQL.
 I primarily used Oracle, Mysql and some DB2.  And I am mainly using Aurora SQL database and  DynamoDB NOSQL nowadays.
 What I learned:
    Even though a lot of online talks mentioned using one single table approach with DynamoDB for microservices, it is not easy to do, especially if a microservice has relatively complex domain and goes through a lot of changes.
    DynamoDB case insensitive search is not straightforward; it takes extra effort compared to SQL. When data needs to be encrypted and searched, be careful about the data management (for example, use lower case, use hash).
    DynamoDB throttling can be a pain depending on how you manage it, or whether your company is willing to pay more to avoid throttling.
    DynamoDB global secondary indexes could become expensive if you have too many.
    Some DynamoDB tricks mentioned online, for example, using some fake hashkey to enforce unique key constraints, using the same column to store many different types of data, could become a nightmare for application maintenance.  Not many developers will be able to understand the code easily.
   DynamoDB transaction apis could provide great value to developers that come from the SQL world. But the transaction APIs are more expensive and relatively slow.
   Even with DynamoDB, you may still need to manage some kind of relationships between entities, which basically is similar to what you would do with SQL DB.
   DynamoDB is really a dummy map.  Some things you normally can do with SQL DB, for example, create timestamp, SEQUENCE, becomes a burden in application code.
   DynamoDB stream is actually a great feature. But when integrating with lambdas, some duplication processing could happen in some cases.
   SQL database is not bad when it comes to handling large volume. Facebook mysql use is a good example. It takes good design, tuning, and maybe even customization.

 

Friday, December 1, 2017

Machine Learning

Machine Learning

Purpose
    Machine Learning is to generalize.

Classic Problem
    Normal Programming: "Hello world"
    Machine Learning:  MNIST

Approach
    Problems --> Tools--->Metrics  (apply to all problems?)
    Data to generalize --> Use different algorithms --> Monitor performance of algorithms and adjust

Key Words
     Classification
        Discrete output
     Regression
        Continuous numeric output
    Clustering
 
     Gradient descent, Backpropagation, Cost function,
     Cross-entropy
           Any loss consisting of a negative log-likelihood between the empirical distribution
           defined by the training set and the probability distribution defined by model. For example,
           Mean Squared Error: cross-entropy between empirical distribution and a Gaussian model

     Activation function
           Step function
                discrete 0, 1
           Sigmoid function
             
           Tanh function

           Rectified Linear function (ReLU)

           Exponential linear unit (ELU)

     Training data set
           Train parameter
     Validation data set
           Tune Hyperparameter
     Test data set
   
     Bias, Variance
         Linked to capacity, underfitting, overfitting

     Closed-form solution

     Parameter
            Learned
            Weight, Bias
     HyperParameter
           Tuned
           Learning rate
           number of layers
           number of nueons each layer
           number of iterations
       
     Accuracy
     Sensitivity
     Specificity
     F1-score

    Kernel trick
    Maximum likelihood estimation
             Point estimate of variables

     Bayesian estimation
             Full distribution of variables
 
    Optimization
         Hill Climbing
              One step along axis one time
              Achieve Optimal solution for Convex problem
              Problems: local maxima, ridges and alleys, plateau
              Good for function complex and/or not differentiable

         Gradient Descent
             Vanishing/exploding gradients problems
             approaches to solve: He initialization, Batch Normalization

         Momentum

          AdaGrad

          RMSProp

          Adam
       
    Regularization
        Modification to ML algorithms, intending to reduce generalization error, not training error
        Example: weight decay for linear regression
        Early stoppping, L1, L2, Dropout, Max-Norm, Data Augmentation

     Generalize
            To have small gap between training error and test error
     Supervised Learning
            features + labels
            Nonprobabilistic SL
                  K-Nearest Neighbor
             Decision Tree
     Unsupervised Learning
            features without labels
     Reinforcement Learning
             Learning by getting feedback from the environment

     Transformer:
         Modify or filter data before feeding it to learning algorithms
         Preprocessing
         Feature selection
         Feature extraction
         Dimension reduction (PCA, manifold learning)
         Kernel approximation

    Cross-validation schemes
         K-fold
         Stratified K-fold
         Leave-one-out (small amount of data)

    Dimension Reduction
         PCA
         KPCA
         LLE
 
Math behind ML
      z=wx+b
    σ(z)=1/(1+ez)
    ...


Concepts
     Model--Train--Evaluate--Predict
     Classification, Regression, Clustering, Dimension deduction

Algorithms
    Linear Regression
        Find optimal weights by solving normal equations

    Logistic Regression
         No closed-form solution. Maximizing the log-likelihood, or minimizing the negative log-likelihood using gradient descent.

    Neural Network
    RNN (Recurrent Neural Network)
    CNN (Convolutional Neural Network)

    Decision Tree

    Identification Tree

    Naive Bayes
           Features independent of each other
           Conditional Probability Model
           Highly scalable, only requires small amount of training data
           Linear Performance Time
           Generally outperformed by other algorithms, SVM...

    Support Vector Machines
           For both classification and regression
           Widest street to separate instances of different classes

    Random Forest
         Decision Tree ensemble

Test Methodologies
   Leave one out   LOO
       for small amount of data

   Data split (80/20)
     
Practical Guidelines for DNN
   Initialization                        He
   Activation                           ELU
   Normalization                     Batch Normalization
   Regularization                    Dropout
   Optimizer                           Adam
   Learning Rate Schedule     None

Software
    Tensorflow, Scikit-learn
    Spark MLLib, Spark ML,  Weka,


Use cases
    Linear Regression
          House size---> House price in a community
 
    Naive Bayes
          Document classification: separate legitimate emails from spam emails
          For example, based on key words: cheap, free
     
Questions
       When to use which algorithm(s)?

Classic Applications
       Alphago vs Lee Sedol
       https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

       Autonomous Car

       Netflix movie recommendations

       Image recognitions

       Natural language processing

Summary
     No ML algorithm is universally better than any other algorithm.
     Understand data distribution, and pick proper algorithm(s).

References
 
    Books
          (One of my favorite books, highly recommended)
           Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems

          Algorithms for Reinforcement Learning

           Deep Learning (Adaptive Computation and Machine Learning series)

           http://neuralnetworksanddeeplearning.com/

    TensorFlow

    scikit-learn

    https://www.kaggle.com/

    Reinforcement Learning - David Silver

   http://www.wildml.com/

    https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/

    Machine learning series from Luis Serrano  (Very good explanations for beginners)
    https://www.youtube.com/watch?v=aDW44NPhNw0
    https://www.youtube.com/watch?v=BR9h47Jtqyw&t=24s
    https://www.youtube.com/watch?v=2-Ol7ZB0MmU&t=7s
    https://www.youtube.com/watch?v=IpGxLWOIZy4

    http://scikit-learn.org/stable/tutorial/machine_learning_map/

    https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/
         
    https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf

   https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf

    https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf

    (AWS machine learning service)
    https://aws.amazon.com/blogs/aws/sagemaker/

    (Spark MLlib example)
    https://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf

    https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/

    https://biomedical-engineering-online.biomedcentral.com/articles/10.1186/s12938-017-0378-z

    https://iknowfirst.com/rsar-machine-learning-trading-stock-market-and-chaos

    Mastering the game of Go without human knowledge

 

Tuesday, January 31, 2017

String valueOf() pitfalls

What will the console output of this program?


public class TestStringValueOf {

public static void main(String[] args) {
testStringValueOfChar();
}


      public static void testStringValueOfChar() {
char a = 'a';
String str1 = String.valueOf(a);
String str2 = String.valueOf(a);
System.out.println("char comparison:" + (str1 == str2));


double d = 12.3d;
String str3 = String.valueOf(d);
String str4 = String.valueOf(d);
System.out.println("double comparison:" + (str3 == str4));


boolean b = false;
String str5 = String.valueOf(b);
String str6 = String.valueOf(b);
System.out.println("boolean comparison:" + (str5 == str6));


Object o = null;
String str7 = String.valueOf(o);
String str8 = String.valueOf(o);
System.out.println("Object null comparison:" + (str7 == str8));


Object notNull = new Object();
String str9 = String.valueOf(notNull);
String str10 = String.valueOf(notNull);
System.out.println("Object Not null comparison:" + (str9 == str10));
  }
}

see the end of this article for the output.

Overall, the string comparison should use 'equals' no matter how String objects were created.


-------console output----------

char comparison:false
double comparison:false
boolean comparison:true
Object null comparison:true
Object Not null comparison:false


Monday, July 18, 2016

Spring MVC UTF-8

Key points

web.xml:
     <filter>
    <filter-name>encodingFilter</filter-name>
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
    <init-param>
       <param-name>encoding</param-name>
       <param-value>UTF-8</param-value>
    </init-param>
    <init-param>
       <param-name>forceEncoding</param-name>
       <param-value>true</param-value>
    </init-param>
</filter>
<filter-mapping>
    <filter-name>encodingFilter</filter-name>
    <url-pattern>/*</url-pattern>
</filter-mapping>


Maven pom.xml:
  <properties>
      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
       ...
  </properties>

JSP:
   <%@ page language="java" pageEncoding="UTF-8"%>
  <%@ page contentType="text/html;charset=UTF-8" %>


Friday, June 17, 2016

Compile xsl files and store in cache to improve XSLT performance


Common code found online to do XSLT transformation. (removed non essential pieces for brevity)

------------------
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer(new StreamSource(new File(xsltPath)));
transformer.transform(new StreamSource(new File(sourceFilePath)), new StreamResult(new File(resultPath)));
----------------

The code works. But if a xslt file is relatively big and  
needs to be used over and over again to transform 
a lot of files, for example, in the batch mode, 
it may not perform well. 


The following shows a way to cache the compiled version of an xsl file, which is a 'Templates' object. This object is thread safe.

Code snippet to cache the 'Templates' object.

static final Map<String, Templates> cacheTemplates = new ConcurrentHashMap<String, Templates>();

       static TransformerFactory transformFactory = null;

       static {
             init();
       }

     private static void init() {
         try {
             transformFactory =TransformerFactory.newInstance();
        }
        catch(Exception e) {
            throw new RuntimeException(e);
        }
    }

     public static void cacheCompiled( String xsl) {
File file = null;
  StreamSource source= null;
                Templates  templates = null;
try {
file = new File( xsl);
source = new StreamSource(file);
                         templates = transformFactory.newTemplates(source); //create this once for a file, save in a cache.
cacheTemplates .put(xsl, templates );
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
}
}

The above 'templates' object is basically a coompiled version of the original xsl file.  If the original file is relatively big, for example, 20KB, it takes more than 2 seconds on my local machine to transform a small file.  Without caching the templates, it takes more than 2 seconds every time.  With caching,  it takes about 0.1 seconds  for every transformation after the first time.



The basic code is like this:

//get the Templates object from cache based on the xsl file name, then get a Transformer object

Transformer transformer = templates.newTransformer();

transformer.transform(new StreamSource(new File(sourceFilePath)),
new StreamResult(new File(resultPath)));

The 'transformer' object mentioned above is not thread safe.

The SAXON parser seems becoming more popular, and the Xalan parser seems fading away.

The home edition of the SAXON parser, which is free, may be good enough for a lot of applications.

Friday, November 15, 2013

First impressions on open source ESBs

Used commercial ESB and BPMs for a couple of years, recently had a chance to evaluate some open source ESBs.

WSO2:  not easy to use, had difficulty even making the sample projects to work. No DataMapper tool, which is a big no-no to my projects.

Mulesoft ESB:  Nice documentation, instructions easy to follow, sample projects can be built and run in a couple of minutes, nice DataMapper tool in the 3.4 version.   Have not had a chance to build a relatively complex application using this.  Not sure whether the community edition is good enough to be used in the Production.