Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using enc_utf16() in prepare() and valid() functions is not working right. #2252

Closed
5 tasks done
jfoug opened this issue Aug 31, 2016 · 13 comments
Closed
5 tasks done

Comments

@jfoug
Copy link
Collaborator

jfoug commented Aug 31, 2016

Reference original issue #2243

Known formats using utf encoding in prepare/valid functions (note gpu will/should have same issue)

  • o3logon (valid)
  • mscash (valid and prepare calls valid)
  • mscash2 (valid and prepare calls valid)
  • as400-ssha1 (Convert)
  • oracle (prepare)

These appear to be the formats currently standing, which want to use the enc_to_utf16*() type functions (I did not check things like strlen16() or other unicode usages).

@magnumripper
Copy link
Member

magnumripper commented Aug 31, 2016

Here's some things the current, messy, implementation tries to achieve:

  • There are some defaults in config file: DefaultEncoding, DefaultMSCodepage and DefaultInternalCodepage.
  • Command-line encoding options override config file ones. BUT when we parse options we haven't yet loaded a config. That's a mess already.
  • Some formats (eg. LM) has other config-file defaults than the rest (ie. DefaultMSCodepage vs. the legacy default of ISO-8859-1). But at the time we are loading config, we don't yet know what format we'll be using.
  • At the time we do know that we should use DefaultMSCodepage, we still need to keep track of whether some encoding was set with command-line option. Because if it was, that is what we use.
  • Once all three of input, internal and target encodings are known, there are some logic that kicks in if the format uses UTF-16 internally. For example, if we ended up as running NT with UTF-8 -> CP850 -> UTF-8, it will silently change it to UTF-8-> CP850 -> CP850 because the format doesn't care about that difference but this is a huge performance optimization due to less conversions back and forth.

The list will grow as I recall more 😵

@magnumripper
Copy link
Member

Please note that any core changes related to this issue has to go to a topic branch. That includes myself. This is really really tricky shit, I know that from the past. There are sooo many parameters and ways of starting john (let alone resuming).

@jfoug
Copy link
Collaborator Author

jfoug commented Aug 31, 2016

The list will grow as I recall more 😵

I figured as much. This 'seems' like not that big of a deal, but I know that encoding issues is MESSY.

@jfoug
Copy link
Collaborator Author

jfoug commented Aug 31, 2016

I agree on the branch. There was no way I was going to make any change, only to have 9 months go by before obscure stuff shows up totally busted.

@jfoug
Copy link
Collaborator Author

jfoug commented Sep 1, 2016

From #2243

So I propose that dc632b3 is the only thing we actually do for now.

I think if we make sure that all current formats are properly protected from this issue (formats listed in OP of this thread) then at least we buy time to look at this. Encoding is very powerful, but damn, it sux digging through complexities. Kinda like having something obscure in dynamic not work. Some little looking issue can end up being 100 lines changed in 8 places.

@magnumripper
Copy link
Member

magnumripper commented Sep 1, 2016

IRL this is much less of an issue than it may seem. I believe the problems are only showing up when using input files that are not encoded in UTF-8? At least that was the case in #2243.

All the affected formats are best used with UTF-8 input files. In the unlikely case a user picked some legacy encoding instead, he really should specify it on the command line and not rely on any heuristics (the heuristics are mainly for LM which is somewhat of a special case for several reasons).

So possibly this is a "solution":

  1. Change the o3logon fix so it emits an informative warning (once) when we see that "negative length". Something along the lines of "Input is not UTF-8, you need to specify --input-encoding".
  2. Ensure the other affected formats (listed in OP) gets same protection against segfault + warning output.

@magnumripper
Copy link
Member

Please test cd4cf0d. It now bails with error like this:

$ ../run/john test.in -form:o3logon -w
Warning: invalid UTF-8 seen reading test.in
o3logon: Input file is not UTF-8. Please use --input-enc to specify a codepage.

Similar fixes made to mscash and oracle formats.

@magnumripper
Copy link
Member

magnumripper commented Sep 4, 2016

An even simpler fix made to as400-ssha1 in 33e4c9f.

@jfoug
Copy link
Collaborator Author

jfoug commented Sep 9, 2016

Are you sure you want to error() in this valid? I would think a 1 time message, and then return 0 for the bad hashes would be more proper for valid. If the user calls john with no -format= then each format's valid gets blindly called. Would suck to have an input file that john bailed out on, especially since the format does NOT know if it is being used, OR if it is just being queried for validity.

@jfoug
Copy link
Collaborator Author

jfoug commented Sep 10, 2016

9037a83 and 2684366

@magnumripper
Copy link
Member

You are right. Thanks!

@jfoug
Copy link
Collaborator Author

jfoug commented Sep 10, 2016

Ok, this 'buys' us a bandage, until we can figure out a real solution, OR close this with a document listing that encoding usage is not valid until after init() has been called.

magnumripper added a commit that referenced this issue Sep 11, 2016
@magnumripper
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants