singularitynlpteam - Tumblr blog

singularitynlpteam · 4 years

Text

What we can learn from the modeling part

When our group was fitting the modeling to predict the interest rate using FOMC minutes and statements, we figured out three important lessons. The first thing we found out is that the training model will randomly select subsamples from the whole sample. So, in order to reduce the random error, we took repeated measurements to obtain an average value instead of taking one value from one random sample selection. We calculated the average using adjusted parameters of range in the ‘for loop’ part of the code. In the code, we used the range size of three so we would randomly repeat using the training model three times and chose three different subsamples to do training separately. This measure would allow us to arrive at a more accurate prediction for this model using the average adjusted value.

When we were choosing the models, we used both the Naive Bayes model and XGboost model to predict the interest rate. It turns out the Naive Bayes model runs faster than the XGboost model. This is likely due to the fact that the Naive Bayes model directly uses the Bayes theorem, however, the Xgboost uses Classification And Regression Tree (CART) Algorithm, which is slower and more complicated especially in large data sets.

The last problem we encounter is to install XGboost package in python, particularly in macOS as the installment is one of the frequently asked questions. XCode for Mac comes with a clang compiler but does not support OpenMP, so it cannot be used to compile XGBoost. According to the research, we find out the solution contains eight steps to deal with this issue. These eight steps listed below:

In your terminal, install homebrew if you haven’t already. “/usr/bin/ruby -e$(curlfsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" ” If you encounter errors in this step, you should refer to this website“https://computers.tutsplus.com/tutorials/homebrew-demystified-os-xs-ultimate-package-manager--mac-44884.

In the terminal, you should “brew install gcc — without-multilib”, the without multilib should help with multi-threading errors that we may encounter later. This may take some time, but if this doesn’t work try “sudo apt-get install gcc sudo apt-get install g++”

cd into wherever you want your directory to be, I simply put mine at my root directory, but make sure that this is not within another directory! (we don’t need any pregnant repos)

Clone the repository wherever you want your directory to be using this line “git clone — recursive https://github.com/dmlc/xgboost”

cd into your new directory called xgboost “cd xgboost”

Open the make/config.mk. If you have your computer formatted correctly, simply typing “open make/config.mk” may work. If not you can use the GUI to open it in your favorite text editor, or another command. Uncomment these two lines of code : “export CC = gcc, export CXX = g++”. Depending on which version of gcc you installed, you may need to change them to “export CC = gcc-6, export CXX = g++-6”. Make sure you save this and then move on to the next step.

cd into xgboost (if you’re not already there), and then“cp make/config.mk .make -j4

Final step : “cd python-package; sudo python setup.py install”

To sum up, the most time-consuming issues we faced are debugging and choosing the models. Every decision should be made carefully and logically. We are glad we were able to meticulously finish coding this part.

#fomc #nlp #text analytics #SingularityNLP team HKU

0 notes

singularitynlpteam · 4 years

Text

Lessons Learnt from Web-scraping and Data-preprocessing part

As our goal is to analyze Fed’s statements and minutes, we need to get textual data from the federal reserve website. I encountered several difficulties when I attempted to scrape the data.

As professor Buehlmaier taught us the re module in python, it is a very useful code for matching similarities in websites I wanted to scrape and get date information I needed from the data. We used \d to get the digits in the website which corresponded to the date of the statements/minutes we were trying to get. The date was coded as =re.search("(\d{4})(\d{2})(\d{2})", nurl)[0]

Since data from before 2015 and after 2015 are stored at two different sites, we need to have two sets of urls for different years. Most of the old data is stored in different formats so I couldn’t get all the years in my original code. I had to go back to old links before 2015 and figured out the differences in pattern as the href pattern would contain the year information and the string after the HTML tag ‘a’ would contain ‘Statement.*’ (S is capitalized and we need to pay close attention to minor differences when coding). Code:

Another important preprocessing part of our data is to match the data with future interest rate change, whether this textual data would lead to rate cut or rate hike or rate unchanged. In order to do this, we need to match textual data with the federal fund rate at the current meeting, the next meeting and the difference would be the upcoming change we wanted to predict. We used the code shift(-1) to shift the interest rate data for the next meeting date upwards to the current meeting rate row and calculated the difference to achieve our goal.

data_df[column_diff] = data_df[column_rate].diff(prediction_shift) # t(n) - t(0) data_df[column_diff] = data_df[column_diff].shift(-1 * prediction_shift)

#nlp #text analytics #fomc #preprocessing textual data #SingularityNLP team HKU

0 notes